 Dings:2002


The effects of matrix sampling on student score comparability in constructedresponse and multiplechoice assessments
J. Dings and R. Childs and N. Kingston
(2002)

 Thomas:2002b


EMBEDDING IRT IN STRUCTURAL EQUATION MODELS: A COMPARISON WITH REGRESSION BASED ON IRT SCORES
D. R. Thomas and I. R. R. Lu and B. D. Zumbo
(2002)

 Thamerus:1996


Fitting a finite mixture distribution to a variable subject to heteroscedastic measurement error
M. Thamerus
(1996)

 Yamamoto:1999


Scaling Methodology and Procedures for the TIMSS Mathematics and Science Scales
K. Yamamoto and E. Kulick
(1999)

 Yeh:2007


Using Trapezoidal Rule for the Area Under a Curve Calculation
S. Yeh
(2007)

 Hardouin:2007b


The SAS MacroProgram %AnaQol to Estimate the Parameters of Item Responses Theory Models
J. Hardouin
Communications in Statistics  Simulation and Computation
36
437453
(2007)

 Fox:2007a


Multilevel IRT Model Assessment
J. Fox
(2007)

 Fox:2007


Modeling Measurement Error in Structural Multilevel Models
J. Fox and C. A. W. Glas
(2007)

 Ark:2005b


The Effect of Missing Data Imputation on Mokken Scale Analysis
L. A. van der Ark and K. Sijtsma
(2005)

 Ark:2005a


Statistical Models for Categorical Variables
L. A. van der Ark and M. A. Croon and K. Sijtsma
(2005)

 Ark:2002


Hierarchically Related Nonparametric IRT Models, and Practical Data Analysis Methods
L. A. van der Ark and B. T. Hemker and K. Sijtsma
(2002)

 Sijtsma:2001


Progress in NIRT Analysis of Polytomous Item Scores: Dilemmas and Practical Solutions
K. Sijtsma and L. A. van der Ark
(2001)
This paper discusses three open problems in nonparametric polytomous item re
sponse theory: (1) theoretically, the latent trait $\theta$ is not stochastically ordered by the observed total score X+; (2) the models do not imply an invariant item ordering; and (3) the regression of an item score on the total score X+ or on the restscore R is not a monotone nondecreasing function and, as a result, it cannot be used for investigating the monotonicity of the item step response function. Tentative solutions for these problems are discussed. The computer program MSP for nonparametric IRT analysis is based on models which neither imply the stochastic ordering property nor an invariant item ordering. Also, MSP uses itemrestscore regression for investigating item step response functions. It is discussed whether computer programs may be based temporarily) on models which lack desirable properties and use methods which are not
(yet) supported by sound psychometric theory.

 Ark:1999


Contributions to Latent Budget Analysis: A Tool For the Analysis of Compositional Data.
L. A. van der Ark
(1999)

 Ark:1998


Graphical Display of Latent Budget Analysis and Latent Class Analysis, with Special Reference to Correspondence Analysis
L. A. van der Ark and P. G. M. van der Heijden
(1998)

 Heijden:2002


Some Examples of Latent Budget Analysis and its Extensions
P. G. M. van der Heijden and L. A. van der Ark and A. Mooijaart
(2002)

 Thomas:2002a


APPLYING ITEM RESPONSE THEORY METHODS TO COMPLEX SURVEY DATA
D. R. Thomas and A. Cyr
(2002)

 Carletta:1996


Assessing agreement on classification tasks: the kappa statistic
J. Carletta
Computational Linguistics
22
(1996)

 Zou:2004


Sparse principal component analysis
H. Zou and T. Hastie and R. Tibshirani
(2004)

 Bond:2003b


Measuring Client Satisfaction with Public Education III: Group Effects in Client Satisfaction
T. G. Bond and J. A. King
Journal of Applied Measurement
4
326334
(2003)

 Bond:2003a


Measuring Client Satisfaction with Public Education II: Comparing Schools with State Benchmarks
T. G. Bond and J. A. King
Journal of Applied Measurement
4
258268
(2003)

 King:2003


Measuring Client Satisfaction with Public Education I: Meeting Competing Demands in Establishing Statewide Benchmarks
J. A. King and T. G. Bond
Journal of Applied Measurement
4
111123
(2003)

 Smits:2003a


A Componential IRT Model for Guilt
D. J. M. Smits and P. D. Boeck
Multivariate Behavioral Research
38
161188
(2003)

 Jehangir:2005


Evaluation of Relations between Scales in an IRT Framework
K. Jehangir
(2005)

 Schumacher:1996


Neural network and logistic regression. Part I
M. Schumacher and R. Rossner and W. Vach
(1996)

 Tricot:2000


Un modèle de réponses aux items. Propriétés et comparaison de groupes de traitement en épidémiologie
J. Tricot and M. Mesbah
Revue de Statistique Appliquée
48
2939
(2000)

 Ricker:2003


Setting Cut Scores: Critical Review of Angoff and ModifiedAngoff Methods
K. L. Ricker
(2003)
This paper presents a critical review of the Angoff (1971) and Angoff derived methods,
according to criteria for assessing cut score setting methods originally proposed by Berk
(1986) and further recommendations by Hambleton (2001). The criteria have been
updated to reflect the progress that has been made in standard setting research over the
past 17 years. The paper also discusses the assumptions of the Angoff method, and other current issues surrounding this method. Recommendations for using the Angoff method are made.

 Sheng:2005


BAYESIAN ANALYSIS OF HIERARCHICAL IRT MODELS: COMPARING AND COMBINING THE UNIDIMENSIONAL & MULTIUNIDIMENSIONAL IRT MODELS
Y. Sheng
(2005)

 Verstralen:2000


IRT models for subjective weights of options of multiple choice questions
H. H. F. M. Verstralen and N. D. Verhelst
(2000)

 Lauritzen:2007


Exchangeable Rasch Matrices
S. L. Lauritzen
(2007)

 Davidson:2006


Bootstrap Inference in a Linear Equation Estimated by Instrumental Variables
R. Davidson and J. MacKinnon
(2006)

 Festy:2008


MESURES, FORMES ET FACTEURS DE LA PAUVRETÉ. APPROCHES COMPARATIVES
P. Festy and L. Prokofieva
(2008)

 Ward:2008


Presenceonly data and the EM algorithm
G. Ward and T. Hastie and S. C. Barry and J. Elith and J. R. Leathwick
Biometrics
(2008)

 Ponocny:2002


On the applicability of some IRT models for repeated measurement designs: Conditions, consequences, and GoodnessofFit tests
I. Ponocny
Methods of Psychological Research Online
7
2140
(2002)

 Rouder:2005


A hierarchical model for estimating response time distributions
J. N. Rouder and J. Lu and P. Speckman and D. Sun and Y. Jiang
Psychonomic Bulletin & Review
12
195223
(2005)

 Castelloe:2007


Power and Sample Size Determination for Linear Models
J. M. Castelloe and R. G. O'Brien
(2007)

 Zubicaray:2007


Support for an autoassociative model of spoken cued recall: Evidence from fMRI
G. de Zubicaray and K. McMahon and M. Eastburn and A. J. Pringle and L. Lorenz and M. S. Humphreys
Neuropsychologia
45
824835
(2007)

 Gibbons:2007


The Added Value of Multidimensional IRT Models
R. D. Gibbons and J. C. Immekus and R. D. Bock
(2007)

 Diaz:2006


NAEPQA FY06 Special Study: 12th Grade Math Trend Estimates
T. E. Diaz and H. A. Le and L. L. Wise
(2006)

 Keerthi:2002


A fast dual algorithm for kernel logistic regression
S. S. Keerthi and K. Duan and S. K. Shevade and A. N. Poo
(2002)

 Bystrom:2007


TASK COMPLEXITY AFFECTS INFORMATION SEEKING AND USE
K. Byström and K. Järvelin
(2007)

 Saxton:2005


Development of a Short Form of the Severe Impairment Battery
J. Saxton and K. B. Kastango and L. HugonotDiener and F. Boller and M. Verny and C. E. Sarles and R. R. Girgis and E. Devouche and P. Mecocci and B. G. Pollock and S. T. DeKosky
American Journal of Geriatric Psychiatry
13
(2005)

 Assaf:2007


A new approach for interexaminer reliability data analysis on dental caries calibration
A. V. Assaf and E. P. da Silva Tagliaferro and M. de Castro Meneghim and C. Tengan and A. C. Pereira and G. M. B. Ambrosano and F. L. Mialhe
Journal of Applied Oral Science
15
(2007)

 Devouche:2003


Les banques d'items. Construction d'une banque pour le Test de Connaissance du Français
E. Devouche
Psychologie et Psychométrie
24
5788
(2003)

 Postlethwaite:1993


TORSTEN HUSÉN
T. N. Postlethwaite
Perspectives : revue trimestrielle d'éducation comparée
XXIII
697707
(1993)

 Roju:1995


IRTBased Internal Measures of Differential Functioning of Items and Tests
N. S. Roju and W. J. van der Linden and P. F. Fleer
Applied Psychological Measurement
19
353368
(1995)

 Jacobusse:2006


An interval scale for development of children aged 0 2 years
G. Jacobusse and S. van Buuren and P. H. Verberk
Statistics in Medicine
25
22722283
(2006)

 Howell:2005


A model of grounded language acquisition: Sensorimotor features improve lexical and grammatical learning
S. R. Howell and D. Jankowicz and S. Becker
Journal of Memory and Language
53
258276
(2005)

 Hastedt:2007


Differences between multiplechoice and constructed response items in PIRLS 2001
D. Hastedt
(2007)

 Devroye:2007


NONUNIFORM RANDOM VARIATE GENERATION
L. Devroye
(2007)
This chapter provides a survey of the main methods in nonuniform random variate generation, and highlights recent research on the sub ject. Classical paradigms such as inversion, rejection, guide tables, and transformations are reviewed. We provide information on the expected time complexity of various algorithms, before addressing modern topics such as indirectly specified distributions, random processes, and Markov chain methods.

 Goldstein:1999


Multilevel statistical models
H. Goldstein
(1999)

 Michailidis:2007


Multilevel Homogeneity Analysis
G. Michailidis
(2007)

 CoE:2005


The Common European Framework
CoE
(2005)

 Meade:2004


Exploratory Measurement Invariance: A New Method Based on Item Response Theory
A. W. Meade and J. K. Ellington and S. B. Craig
(2004)

 Lunz:2007


Examination Development Guidelines
M. E. Lunz
(2007)

 Courville:2004


An empirical comparison of item response theory and classical test theory item/person statistics
T. G. Courville
(2004)

 Keller:2002


Annual College of Education Educational Research Exchange
(2002)

 Stage:2007


A Comparison Between Item Analysis Based on Item Response Theory and Classical Test Theory. A Study of the SweSAT Subtest WORD
C. Stage
(2007)

 Stage:2003


Classical Test Theory or Item Response Theory: The Swedish Experience
C. Stage
(2003)

 Yu:2007


Automation and visualization of distractor analysis using SAS/GRAPH
C. H. Yu
(2007)

 Robinson:2000


Canadian Journal of Education
S. Robinson
25
(2000)

 Mojduszka:2000


Consumer Choice of Food Products and the Implications for Price Competition and Government Labeling Policy
E. M. Mojduszka and J. A. Caswell and J. M. Harris
(2000)

 GarciaPerez:1999


Fitting Logistic IRTModels: Small Wonder
M. A. GarciaPerez
The Spanish Journal of Psychology
2
7494
(1999)

 Zubairi:2006


Classical And Rasch Analyses Of Dichotomously Scored Reading Comprehension Test Items
A. M. Zubairi and N. L. A. Kassim
Malaysian Journal of ELT Research
2
(2006)

 Yamamoto:2002


Estimating PISA students on the IALS prose literacy scale
K. Yamamoto
(2002)

 Stewart:2005


Absolute Identification by Relative Judgment
N. Stewart and G. D. A. Brown and N. Chater
Psychological Review
112
881911
(2005)

 Cazievel:2000


Estimation for the Rasch Model under a linkage structure: a case study
V. Cazievel
(2000)

 Hochheiser:1999


Performance Benefits of Simultaneous over Sequential Menus As Task Complexity Increases
H. Hochheiser and B. Shneiderman
(1999)

 EVSmith:2006


Book Review: Developing and Validating MultipleChoice Test Items (3rd ed.)
J. E V Smith
Applied Psychological Measurement
30
6972
(2006)

 Chen:2006


Verification of Cognitive Attributes Required to Solve the TIMSS1999 Mathematics Items for Taiwanese Students
Y. Chen and J. Gorin and M. Thompson
(2006)

 Shigemasu:2000


Bayesian hierarchical analysis of polytomous item responses
K. Shigemasu and O. Yoshimura and T. Nakamura
Behaviormetrika
27
5165
(2000)

 Schwarz:1995


What respondents learn from questionnaires: The survey interview and the logic
N. Schwarz
International Statistical Review
63
153177
(1995)

 Bryce:1981


RaschFitting
T. G. K. Bryce
British Educational Research Journal
7
(1981)

 Adams:1997


The Multidimensional Random Coefficients Multinomial Logit Model
R. J. Adams and M. Wilson and W. Wang
Applied Psychological Measurement
21
124
(1997)

 Monseur:2007


Equating errors in international surveys in education
C. Monseur and H. Sibbens and D. Hastedt
(2007)

 Brown:2005


The Multidimensional Measure of Conceptual Complexity
N. J. S. Brown
(2005)

 Mitkov:2005


A computeraided environment for generating multiplechoice test items
R. Mitkov and L. A. Ha and N. Karamanis
Natural Language Engineering
1
117
(2005)

 Wu:2006


Modelling Mathematics Problem Solving Item Responses Using a Multidimensional IRT Model
M. Wu and R. Adams
Mathematics Education Research Journal
18
93113
(2006)

 Watson:2006


A Longitudinal Study of Student Understanding of Chance and Data
J. Watson and B. Kelly
Mathematics Education Research Journal
18
4055
(2006)

 Stacey:2006


A Case of the Inapplicability of the Rasch Model: Mapping Conceptual Learning
K. Stacey and V. Steinle
Mathematics Education Research Journal
18
7792
(2006)

 Grimbeek:2006


Surveying Primary Teachers about Compulsory Numeracy Testing: Combining Factor Analysis with Rasch Analysis
P. Grimbeek and S. Nisbet
Mathematics Education Research Journal
18
2739
(2006)

 Doig:2006


Easier Analysis and Better Reporting: Modelling Ordinal Data in MEducation Research
B. Doig and S. Groves
Mathematics Education Research Journal
18
5676
(2006)

 Bradley:2006


Applying the Rasch Rating Scale Model to Gain Insights into Students' Conceptualisation of Quality Mathematics Instruction
K. Bradley and S. Sampson and K. Royal
Mathematics Education Research Journal
18
1126
(2006)

 Willms:2007


A Manual for Conducting Analyses with Data from TIMSS and PISA
J. D. Willms and T. Smith
(2007)

 Dray:2003


Coinertia analysis and the linking of ecological data tables
S. Dray and D. Chessel and J. Thioulouse
Ecology
84
30783089
(2003)

 Leeuw:1986


Random coefficient models for multilevel analysis
J. de Leeuw and I. Kreft
Journal of Educational Statistics
11
5785
(1986)

 Benjamini:2002


John W. Tukey's contributions to multiple comparisons
Y. Benjamini and H. Braun
The Annals of Statistics
30
15761594
(2002)

 Holmes:2005


Multivariate data analysis: The french way
S. Holmes
(2005)

 Davier:1997


WINMIRA  program description and recent enhancements
M. von Davier
Methods of Psychological Research  Online
2
2528
(1997)

 HugonotDiener:2003


Version abrégée de la severe impairment battery (SIB)
L. HugonotDiener and M. Verny and E. Devouche and J. Saxton and P. Mecocci and F. Boller
Psychologie \& Neuropsychiatrie du Vieillissement
1
273283
(2003)

 CJE:2000


Canadian Journal of Education
25
(2000)

 Antonietti:2006


Mesures objectives de traits latents
J. Antonietti
(2006)

 Antonietti:2004


Comment s'assurer de l'alignement d'un ensemble d'items
J. Antonietti
(2004)

 Antonietti:2003b


Designs de testage incomplets et modèle nonparamétrique de la réponse à l'item
J. Antonietti
(2003)

 Antonietti:2003a


Comment mesurer la similarité entre deux stuctures factorielles latentes
J. Antonietti
(2003)

 Christensen:2003a


SAS macros for Rasch based latent variable modelling
K. B. Christensen and J. B. Bjorner
(2003)

 Christensen:2003


Latent Covariates in Generalized Linear Models
K. B. Christensen and M. L. Nielsen and L. SmithHansen
(2003)

 Walker:2000


Forecasting the political behavior of leaders with the verbs in context system of operational code analysis
S. G. Walker
(2000)

 Camiz:2005


Application de l'analyse factorielle multiple pour le traitement de caractères en échelle dans les enquêtes
S. Camiz and J. Pagès
(2005)

 Claeskens:2007


On local estimating equations in additive multiparameter models
G. Claeskens and M. Aerts
(2007)

 AlKandari:1993


Variable Selection and Principal Component Analysis
N. AlKandari
(1993)

 Calvo:2007


A Comparative Study of Principal Component Analysis Techniques
R. A. Calvo and M. Partridge and M. A. Jabri
(2007)

 Goodman:2002


Applied Latent Class Analysis
L. A. Goodman
(2002)

 Balidis:2002


Intraobserver and interobserver reliability of the R/D score for evaluation of iris configuration by ultrasound biomicroscopy, in patients with pigment dispersion syndrome
M. O. Balidis and C. Bunce and K. Boboridis and J. Salzman and R. P. L. Wormald and M. H. Miller
Eye
16
722726
(2002)

 Birkett:1986


Selecting the number of response categories for a Lickerttype scale
N. J. Birkett
(1986)

 Bhakta:2005


Using item response theory to explore the psychometric properties of extended matching questions examination in undergraduate medical education
B. Bhakta and A. Tennant and M. Horton and G. Lawton and D. andrich
BMC Medical Education
5
(2005)

 Kadouri:2007


The improved Clinical Global Impression Scale (iCGI): development and validation in depression
A. Kadouri and E. Corruble and B. Falissard
BMC Psychiatry
7
(2007)

 RevahLevy:2007


The Adolescent Depression Rating Scale (ADRS): a validation study
A. RevahLevy and B. Birmaher and I. Gasquet and B. Falissard
BMC Psychiatry
7
(2007)

 Montanari:2000


Independent Factor Discriminant Analysis
A. Montanari and D. G. Calo and C. Viroli
(2000)

 Schafer:2002


Computational strategies for multivariate linear mixedeffects models with missing values
J. L. Schafer and R. M. Yucel
Journal of Computational and Graphical Statistics
11
437457
(2002)

 Ackerman:1996


Graphical Representation of Multidimensional Item Response Theory Analyses
T. Ackerman
Applied Psychological Measurement
20
311329
(1996)

 Stein:2007


Calculation of the Kappa Statistic for Interrater Reliability: The Case Where Raters Can Select Multiple Responses from a Large Number of Categories
C. R. Stein and R. B. Devore and B. E. Wojcik
(2007)

 Leeuw:2007


Statistics and Probability
J. de Leeuw
(2007)

 Graves:1995


The pseudoscience of psychometry and the Bell Curve
J. L. Graves
Journal of Negro Education
64
277
(1995)

 CikrikciDemirtasli:2000


A study of Raven Standard Progressive Matrices test's item measures under classic and item response models: An empirical comparison
N. CikrikciDemirtasli
(2000)

 Stuger:2006


Asymmetric Loss Functions and Sample Size Determination: A Bayesian Approach
H. P. Stüger
Austrian Journal of Statistics
35
5766
(2006)

 Antonietti:2003


Evaluation des compétences en mathématiques en fin de 2e année primaire
J. Antonietti and N. Guignard and A. Mudry and L. Ntamakiliro and W. Rieben and C. T. Christinat and A. V. der Klink
(2003)

 Charland:1996


Fidélité et validité de la version française du "Children of Alcoholics Screening Test" (CAST)
H. Charland and G. Côté
Revue québécoise de psychologie
17
4562
(1996)

 Wu:2005


Algorithmes et codes R pour la méthode de la pseudovraisemblance empirique dans les sondages
C. Wu
Techniques d'enquête
31
261266
(2005)

 Grim:2005


Checking for Nonresponse Bias in WebOnly Surveys of Special Populations using a MixedMode (WebwithMail) Design
B. J. Grim and L. M. Semali
(2005)

 Youngstrom:2002


Reliability Generalization of selfreport of emotions when using the Differential Emotions Scale
E. A. Youngstrom and K. W. Green
Educational and Psychological Measurement
62
(2002)

 Yin:2000


Assessing the reliability of Beck Depression Inventory scores: Reliability Generalization across studies
P. Yin and X. Fan
Educational and Psychological Measurement
60
201223
(2000)

 Wallace:2002


Reliability Generalization of the Life Satisfaction Index
K. A. Wallace and A. J. Wheeler
Educational and Psychological Measurement
62
(2002)

 Viswesvaran:2000


Measurement error in "Big Five Factors" personality assessment: Reliability Generalization across studies and measures
C. Viswesvaran and D. Ones
Educational and Psychological Measurement
60
224235
(2000)

 VachaHaase:2001a


Reliability generalization: Exploring reliability variations on MMPI/MMPI2 Validity scale scores
T. VachaHaase and C. R. Tani and L. R. Kogan and R. A. Woodall and B. Thompson
Assessment
8
391401
(2001)

 VachaHaase:2001


Reliability generalization: Exploring reliability coefficients of MMPI clinical scales scores
T. VachaHaase and L. Kogan and C. R. Tani and R. A. Woodall
Educational and Psychological Measurement
61
4559
(2001)

 VachaHaase:2002


Reliability Generalization: Moving toward improved understanding and use of score reliability
T. VachaHaase and R. K. Henson and J. Caruso
Educational and Psychological Measurement
62
(2002)

 VachaHaase:1998


Reliability generalization: Exploring variance in measurement error affecting score reliability across studies
T. VachaHaase
Educational and Psychological Measurement
58
620
(1998)

 Thompson:2002b


Stability of the reliability of LibQUAL+TM scores: A "Reliability Generalization" metaanalysis study
B. Thompson and C. Cook
Educational and Psychological Measurement
62
(2002)

 Reese:2002


A Reliability Generalization study of select measures of adult attachment style
R. J. Reese and K. M. Kieffer and B. K. Briggs
Educational and Psychological Measurement
62
(2002)

 Nilsson:2002


Reliability Generalization: An examination of the Career Decisionmaking Selfefficacy Scale
J. E. Nilsson and C. K. Schmidt and W. D. Meek
Educational and Psychological Measurement
62
(2002)

 Lane:2002


Expanding reliability generalization methods with KR21 estimates: An RG study of the Coopersmith Selfesteem Inventory
G. G. Lane and A. E. White and R. K. Henson
Educational and Psychological Measurement
62
(2002)

 Kieffer:2002


A Reliability Generalization study of the Geriatric Depression Scale (GDS)
K. M. Kieffer and R. J. Reese
Educational and Psychological Measurement
62
(2002)

 Henson:2001a


Characterizing measurement error in scores across studies: Some recommendations for conducting "Reliability Generalization" (RG) studies
R. K. Henson and B. Thompson
(2001)
Given the potential value of reliability generalization (RG) studies in the development of cumulative psychometric knowledge, the purpose of this paper is to provide a tutorial on how to conduct such studies and to serve as a guide for researchers wishing to use this methodology. After some brief comments on classical test theory, the paper provides a practical framework for structuring an RG study, including: (1) test selection with an eye toward frequency of test use and reporting practices by authors; (2) development of a coding sheet that will capture potential variation in score reliability across studies; (3) procedural recommendations regarding data collection; (4) identification and use of potential dependent variables; and (5) application of general linear model analyses to the data.

 Henson:2001


A reliability generalization study of the Teacher Efficacy Scale and related instruments
R. K. Henson and L. R. Kogan and T. VachaHaase
Educational and Psychological Measurement
61
(2001)

 Henson:2002


Variability and prediction of measurement error in Kolb's Learning Style Inventory scores: A reliability generalization study
R. K. Henson and D. Hwang
Educational and Psychological Measurement
62
(2002)

 Helms:1999


Another metaanalysis of the White Racial Identity Attitude Scale's Cronbach alphas: Implications for validity
J. E. Helms
Measurement and Evaluation in Counseling and Development
32
122137
(1999)

 Hanson:2002


Reliability Generalization of Working Alliance Inventory scale scores
W. E. Hanson and K. T. Curry and D. L. Bandalos
Educational and Psychological Measurement
62
(2002)

 Dimitrov:2002


Reliability: Arguments for multiple perspectives and potential problems with generalization across studies
D. M. Dimitrov
Educational and Psychological Measurement
62
(2002)

 DeditiusIsland:2002


An examination of the reliability of scores from Zuckerman's Sensation Seeking Scales
H. K. DeditiusIsland and J. C. Caruso
Educational and Psychological Measurement
62
(2002)

 Caruso:2001a


Reliability of scores from the Eysenck Personality Questionnaire: A Reliability Generalization (RG) study
J. C. Caruso and K. Witkiewitz and A. BelcourtDittloff and J. Gottlieb
Educational and Psychological Measurement
61
675682
(2001)

 Caruso:2001


Reliability Generalization of the Junior Eysenck Personality Questionnaire
J. C. Caruso and S. Edwards
Personality and Individual Differences
31
173184
(2001)
A reliability generalization was conducted on the Psychoticism (P), Extraversion (E), Neuroticism (N) and Lie (L) scales of the Junior Eysenck Personality Questionnaire (JEPQ). Twentythree studies provided data on 44 samples of children who had been administered the JEPQ. Score reliability was found to vary significantly both between and within scales. N and L provided the most reliable scores (with median reliabilities of 0.80 and 0.79 respectively) followed by E (median RELIABILITY=0.73) and P (median RELIABILITY=0.68). Scale length was the best predictor of score reliability, but sample gender makeup, language of administration, and the amount of variation in the ages of children in each sample were also significant predictors of reliability for various JEPQ scales. The results highlight the importance of considering reliability to be a property of scores for a particular group, as opposed to a property of a test generally.

 Caruso:2000


Reliability Generalization of the NEO personality scales
J. C. Caruso
Educational and Psychological Measurement
60
236254
(2000)

 Capraro:2002


MyersBriggs Type Indicator score reliability across studies: A metaanalytic Reliability Generalization study
R. M. Capraro and M. M. Capraro
Educational and Psychological Measurement
62
659673
(2002)

 Voelkle:2007


Effect sizes and F ratios < 1.0
M. C. Voelkle and P. L. Ackerman and W. W. Wittmann
Methodology
3
3546
(2007)
Standard statistics texts indicate that the expected value of the F ratio is 1.0 (more precisely: N/(N2)) in a completely balanced fixedeffects ANOVA, when the null hypothesis is true. Even though some authors suggest that the null hypothesis is rarely true in practice (e.g., Meehl, 1990), F ratios < 1.0 are reported quite frequently in the literature. However, standard effect size statistics (e.g., Cohen's f) often yield positive values when F < 1.0, which appears to create confusion about the meaningfulness of effect size statistics when the null hypothesis may be true. Given the repeated emphasis on reporting effect sizes, it is shown that in the face of F < 1.0 it is misleading to only report sample effect size estimates as often recommended. Causes of F ratios < 1.0 are reviewed, illustrated by a short simulation study. The calculation and interpretation of corrected and uncorrected effect size statistics under these conditions are discussed. Computing adjusted measures of association strength and incorporating effect size confidence intervals are helpful in an effort to reduce confusion surrounding results when sample sizes are small. Detailed recommendations are directed to authors, journal editors, and reviewers.

 Capraro:2001


Measurement error of scores on the Mathematics Anxiety Rating Scale across studies.
M. M. Capraro and R. M. Capraro and R. K. Henson
Educational and Psychological Measurement
61
373386
(2001)

 Beretvas:2002


Using mixedeffects models in Reliability Generalization studies
S. N. Beretvas and D. A. Pastor
Educational and Psychological Measurement
62
(2002)

 Beretvas:2002a


A Reliability Generalization study of the MarloweCrowne Social Desirability Scale
S. N. Beretvas and J. L. Meyers and W. L. Leite
Educational and Psychological Measurement
62
(2002)

 Barnes:2002


Reliability Generalization of scores on the Speilberger Statetrait Anxiety Inventory
L. L. B. Barnes and D. Harp and W. S. Jung
Educational and Psychological Measurement
62
(2002)

 Steiger:1992


R2: A computer program for interval estimation, power calculation, and hypothesis testing for the squared multiple correlation
J. H. Steiger and R. T. Fouladi
Behavior Research Methods, Instruments, and Computers
4
581582
(1992)

 Cumming:2008


Inference by eye: Confidence intervals, and how to read pictures of data
G. Cumming and S. Finch
American Psychologist
(2008)

 Cumming:2001


A primer on the understanding, use and calculation of confidence intervals that are based on central and noncentral distributions
G. Cumming and S. Finch
Educational and Psychological Measurement
61
532575
(2001)

 Algina:2003


Approximate confidence intervals for effect sizes
J. Algina and H. J. Keselman
Educational and Psychological Measurement
63
537553
(2003)

 VachaHaase:2004


How to estimate and interpret various effect sizes
T. VachaHaase and B. Thompson
Counseling Psychology
51
473481
(2004)

 Thompson:2008a


Complementary methods for research in education
B. Thompson
(2008)

 Thompson:2008


Research in organizations: Foundational principles, processes, and methods of inquiry
B. Thompson
(2008)

 Thompson:2002a


What future quantitative social science research could look like: Confidence intervals for effect sizes
B. Thompson
Educational Researcher
31
2431
(2002)

 Thompson:2002


"Statistical," "practical," and "clinical": How many kinds of significance do counselors need to consider?
B. Thompson
Journal of Counseling and Development
80
6471
(2002)

 Snyder:1993


Evaluating results using corrected and uncorrected effect size estimates
P. Snyder and S. Lawson
Journal of Experimental Education
61
334349
(1993)

 Rosenthal:1994


The handbook of research synthesis
R. Rosenthal
(1994)

 Olejnik:2000


Measures of effect size for comparative studies: Applications, interpretations, and limitations
S. Olejnik and J. Algina
Contemporary Educational Psychology
25
241286
(2000)

 Kline:2004


Beyond significance testing: Reforming data analysis methods in behavioral research
R. Kline
(2004)

 Kirk:2003


Handbook of research methods in experimental psychology
R. E. Kirk
83105
(2003)

 Kirk:1996


Practical significance: A concept whose time has come
R. Kirk
Educational and Psychological Measurement
56
746759
(1996)

 Hill:2004


Higher education: Handbook of theory and research
C. R. Hill and B. Thompson
19
175196
(2004)

 Cortina:2000


Effect size for ANOVA designs
J. M. Cortina and H. Nouri
(2000)

 Thompson:1994


The Concept of Statistical Hypothesis Testing
B. Thompson
Measurement Update
4
56
(1994)
http://www.coe.tamu.edu/~bthompson/hyptest1.htm

 Thompson:1998a


Five methodology errors in educational research: The pantheon of statistical significance and other faux pas
B. Thompson
(1998)
http://www.coe.tamu.edu/~bthompson/aeraaddr.htm

 Thompson:1999


Common methodology mistakes in educational research, revisited, along with a primer on both effect sizes and the bootstrap
B. Thompson
(1999)
http://www.coe.tamu.edu/~bthompson/aeraad99.htm

 Thompson:1998


Statistical significance and effect size reporting: Portrait of a possible future
B. Thompson
Research in the Schools
5
3338
(1998)

 Moore:1991


A confirmatory factor analysis of the Threat Index
M. K. Moore and R. A. Neimeyer
Journal of Personality and Social Psychology
60
122129
(1991)
The Threat Index (TI), a measure of death concern grounded in personal construct theory, was submitted to psychometric refinement. The factorability of the TI using the traditional splitmatch scoring was compared with methods based on Manhattan, Euclidian, standardized Euclidian, and Mahalanobis distance formulas. Statistical and substantive interpretability were enhanced with the standardized Euclidian factor structure. The LISREL VI program was used to determine the best model for the scale in an exploratory factor analysis. A nonhierarchical, G + 3 model met the criterion of goodness of fit >0.9 for the 1st subsample (n = 405). In a confirmatory factor analysis with a 2nd subsample (n = 405), the model was confirmed. Internal consistency and testretest reliability were acceptable for Global Threat and 3 subfactorsThreat to WellBeing, Uncertainty, and Fatalismand all subfactors were found to be independent of social desirability.

 Agresti:2000


Random effects modeling of categorical response data
A. Agresti and J. G. Booth and J. P. Hobert and B. Caffo
(2000)

 From:2006


Estimation of the paramters of the BirnbaumSaunders distribution
S. G. From and L. Li
Communications in Statistics  Theory and Methods
35
21572169
(2006)

 Berg:2007


Variance decomposition using an IRT measurement model
S. M. van den Berg and C. A. W. Glas and D. I. Boomsma
Behavioral Genetics
37
604616
(2007)

 Jackel:2003


A note on multivariate GaussHermite quadrature
P. Jäckel
(2003)

 Boeck:2005


Conceptual and psychometric framework for distinguishing categories and dimensions
P. D. Boeck and M. Wilson and G. S. Acton
Psychological Review
112
129158
(2005)

 Presnell:1994


Resampling methods for sample survey
B. Presnell and J. G. Booth
(1994)

 Gonzalez:2006


Numerical integration in logisticnormal models
J. Gonz\'{a}lez and F. Tuerlinckx and P. D. Boeck and R. Cools
Computational Statistics \& Data Analysis
51
15351548
(2006)

 Ip:2004


Locally dependent latent trait model for polytomous responses with application to inventory of hostility
E. H. Ip and Y. J. Wang and P. D. Boeck
Psychometrika
69
191216
(2004)

 Hedeker:2000


Application of item response theory models for longitudinal data
D. Hedeker and R. J. Mermelstein and B. R. Flay
(2000)

 Janssen:1999


Confirmatory analyses of componential test structure using multidimensional item response theory
R. Janssen and P. D. Boeck
Multivariate Behavioral Research
34
245268
(1999)

 Komarek:2003


Fast robust logistic regression for large sparse datasets with binary outputs
P. R. Komarek and A. W. Moore
(2003)

 Lubke:2000


Factoranalyzing Likertscale data under the assumption of mutlivariate normality complicates a meaningful comparison of observed groups or latent classes
G. Lubke and B. Muth\'{e}
(2000)

 Leenen:2001


Models for ordinal hierarchical classes analysis
I. Leenen and I. V. Mechelen and P. D. Boeck
Psychometrika
66
389404
(2001)

 Meulders:2003


A taxonomy of latent structure assumptions for probability matrix decomposition models
M. Meulders and P. D. Boeck and I. V. Mechelen
Psychometrika
68
6177
(2003)

 Meulders:2005


Latent variable models for partially ordered responses and trajectory analysis of angerrelated feelings
M. Meulders and E. H. Ip and P. D. Boeck
British Journal of Mathematical and Statistical Psychology
58
117143
(2005)

 Lawrence:2000


Bayesian inference for ordinal data using multivariate probit models
E. Lawrence and D. Bingham and C. Liu and V. N. Nair
(2000)

 Wilson:2003


On choosing a model for measuring
M. Wilson
Methods of Psychological Research Online
8
122
(2003)

 Wermuth:2000


Analysing social science data with graphical Markov models
N. Wermuth
(2000)

 TayLim:2000


Generating item responses for balancedincompleteblock (BIB) design using the generalized partial credit model (GPCM)
B. S. TayLim
(2000)

 Revelle:1979


Very Simple Structure: An alternative procedure for estimating the optimal number of interpretable factors
W. Revelle and T. Rocklin
Multivariate Behavioral Research
14
403414
(1979)

 Rupp:2007


The development, calibration, and inferential validation of standardsbased assessments for english as a first foreign language at the IQB
A. A. Rupp and M. Vock and C. Harsch
(2007)

 Thacher:2005


Using patient characteristics and attitudinal data to identify depression treatment preference groups: A latentclass model
J. A. Thacher and E. Morey and W. E. Craighead
(2005)

 Teresi:2004


Differential item functionning and health assessment
J. Teresi
(2004)

 Hardouin:2007a


Non parametric item response theory with SAS and Stata
J. Hardouin
Journal of Statistical Software
(2007)

 Norquist:2003


Rasch measurement in the assessment of amytrophic lateral sclerosis patients
J. M. Norquist and R. Fitzpatrick and C. Jenkinson
Journal of Applied Measurement
4
249257
(2003)

 Groenen:2006


Visions of 70 years of psychometrics: the past, present, and future
P. J. F. Groenen and L. A. van der Ark
Statistica Neerlandica
60
135144
(2006)

 Martin:2007


On the analysis of bayesian semiparametric IRTtype models
E. S. Martin and A. Jara and J. Rolin and M. Mouchart
(2007)

 Fox:2005b


Bayesian modification indices for IRT models
J. Fox and C. A. W. Glas
Statistica Neerlandica
59
95106
(2005)

 GarciaZattera:2005


Conditional independence of multivariate binary data with an application in caries research
M. J. GarciaZattera and A. Jara and E. Lesaffre and D. Declerck
(2005)

 Finkelman:2007


Using person fit in a body of work standard setting
M. Finkelman and W. Kim
(2007)

 Hamon:2000


Modèle de Rasch et validation de questionnaires de qualité de vie
A. Hamon
(2000)

 Hamon:2002


Statistical Methods for Quality of Life Studies. Design, Measurement and Analysis
A. Hamon and M. Mesbah
(2002)

 Thomas:2002


Intégration de la théorie de la réponse aux items aux modèles par équations structurelles: Comparaison avec une régression fondée sur des scores TRI
D. R. Thomas and I. R. R. Lu and B. D. Zumbo
(2002)

 Fox:2001a


Multilevel IRT: A bayesian perspective on estimating parameters and testing statistical hypotheses
J. Fox
(2001)

 Vermunt:2007


Latent class and finite mixture models for multilevel data sets
J. K. Vermunt
Statistical Methods in Medical Research
(2007)

 Vermunt:2001a


Modeling joint and marginal distributions in the analysis of categorical panel data
J. K. Vermunt and M. F. Rodrigo and M. AtoGarcia
Sociological Methods and Research
30
170196
(2001)

 Vermunt:2001


The use restricted latent class models for defining and testing nonparametric and parame tric IRT models
J. K. Vermunt
Applied Psychological Measurement
25
283294
(2001)

 Ark:2005


Stochastic Ordering of the Latent Trait by the Sum Score Under Various Polytomous IRT Models
L. A. van der Ark
Psychometrika
70
283304
(2005)

 Fox:2005a


Multilevel IRT using dichotomous and polytomous response data
J. Fox
British Journal of Mathematical and Statistical Psychology
58
145172
(2005)

 Cui:2006


The hierarchy consistency index: A personfit statistic for the attribute hierarchy model
Y. Cui and J. P. Leighton and M. J. Gierl and S. M. Hunka
(2006)

 Zijlstra:2007


Outlier detection in test and questionnaire data
W. P. Zijlstra and L. A. van der Ark and K. Sijtsma
Multivariate Behavioral Research
(2007)

 Ginkel:2007b


Multiple Imputation of Item Scores in Test and Questionnaire Data, and Influence on Psychometric Results
J. R. van Ginkel and L. A. van der Ark and K. Sijtsma
Multivariate Behavioral Research
42
387414
(2007)
The performance of five simple multiple imputation methods for dealing with missing data were compared. In addition, random imputation and multivariate normal imputation were used as lower and upper benchmark, respectively. Test data were simulated and item scores were deleted such that they were either missing completely at random, missing at random, or not missing at random. Cronbach's alpha, Loevinger's scalability coefficient H, and the item cluster solution from Mokken scale analysis of the complete data were compared with the corresponding results based on the data including imputed scores. The multipleimputation methods, twoway with normally distributed errors, corrected itemmean substitution with normally distributed errors, and response function, produced discrepancies in Cronbach's coefficient alpha, Loevinger's coefficient H, and the cluster solution from Mokken scale analysis, that were smaller than the discrepancies in upper benchmark multivariate normal imputation.

 Ginkel:2007a


Multiple imputation for item scores when test data are factorially complex
J. R. van Ginkel and L. A. van der Ark and K. Sijtsma
British Journal of Mathematical and Statistical Psychology
(2007)
Multiple imputation under a twoway model with error is a simple and effective method that has been used to handle missing item scores in unidimensional test and questionnaire data. Extensions of this method to multidimensional data are proposed. A simulation study is used to investigate whether these extensions produce biased estimates of important statistics in multidimensional data, and to compare them with lower benchmark listwise deletion, twoway with error and multivariate normal imputation. The new methods produce smaller bias in several psychometrically interesting statistics than the existing methods of twoway with error and multivariate normal imputation. One of these new methods clearly is preferable for handling missing item scores in multidimensional test data.

 RabeHesketh:2001b


Multilevel modeling of cognitive function in schizophrenic patients and their first degree relatives
S. RabeHesketh and T. Toulopoulou and R. M. Murray
Multivariate Behavioral Research
36
279298
(2001)

 Rossi:2007


Factor analysis of the DutchLanguage version of the MCMIIII
G. Rossi and L. A. van der Ark and H. Sloore
Journal of Personality Assessment
88
144157
(2007)

 Fox:2005


Randomized item response theory models
J. Fox
Journal of Educational and Behavioral Statistics
30
124
(2005)

 Jong:2007


Using item response theory to measure extreme response style in marketing research: A global investigation
M. G. D. Jong and J. B. E. M. Steenkamp and J. Fox
Journal of Marketing Research
(2007)

 Petridou:2006


Instability of person misfit and ability estimates subject to assessment modality
A. Petridou and J. Williams
(2006)

 Hardouin:2007


Mathematical methods for survival analysis, reliability and Quality of life
J. Hardouin and M. Mesbah
(2007)

 RabeHesketh:2001a


Maximum likelihood estimation of generalized linear model with covariate measurement error
S. RabeHesketh and A. Skrondal and A. Pickles
The Stata Journal
1
(2001)

 Fox:2004a


Modelling Response Error in School Effectiveness Research
J. Fox
Statistica Neerlandica
58
138160
(2004)

 Fox:2004


Applications of Multilevel IRT Modeling
J. Fox
School Effectiveness and School Improvement
15
261280
(2004)

 Fox:2003


Stochastic EM for Estimating the Parameters of a Multilevel IRT Model
J. Fox
British Journal of Mathematical and Statistical Psychology
56
6581
(2003)

 Fox:2001


Bayesian Estimation of a Multilevel IRT Model using Gibbs Sampling
J. Fox and C. A. W. Glas
Psychometrika
66
269286
(2001)

 Fox:2000


Bayesian Modeling of Measurement Error in Predictor Variables Using Item Response Theory
J. Fox and C. A. W. Glas
(2000)

 Jara:2007


A Dirichlet process mixture model for the analysis of correlated binary responses
A. Jara and M. J. GarciaZattera and E. Lesaffre
Computational Statistics \& Data Analysis
51
54025415
(2007)
The multivariate probit model is a popular choice for modelling correlated binary responses. It assumes an underlying multivariate normal distribution dichotomized to yield a binary response vector. Other choices for the latent distribution have been suggested, but basically all models assume homogeneity in the correlation structure across the subjects. When interest lies in the association structure, relaxing this homogeneity assumption could be useful. The latent multivariate normal model is replaced by a location and association mixture model defined by a Dirichlet process. Attention is paid to the parameterization of the covariance matrix in order to make the Bayesian computations convenient. The approach is illustrated on a simulated data set and applied to oral health data from the Signal Tandmobiel^(R) study to examine the hypothesis that caries is mainly a spatially local disease.

 King:2001


Analyzing incomplete political science data: An alternative algorithm for multiple imputation
G. King and J. Honaker and A. Joseph and K. Scheve
American Political Science Review
95
4969
(2001)

 Skrondal:2003


Some applications of generalized linear latent and mixed models in epidemiology: Repeated measures, measurement error and multilevel modeling
A. Skrondal and S. RabeHesketh
Norsk Epidemiologi
13
265278
(2003)

 Pornel:2004


A new statistic to detect misfitting score vector
J. B. Pornel and L. S. Sotaridona and A. L. Vallejo
(2004)

 Ginkel:2007


Twoway imputation: A bayesian method for estimating missing scores in tests and questionnaires, and an accurate approximation
J. R. van Ginkel and L. A. van der Ark and K. Sijtsma and J. K. Vermunt
Computational Statistics \& Data Analysis
51
40134027
(2007)

 RabeHesketh:2001


Parametrization of multivariate random effects models for categorical data
S. RabeHesketh and A. Skrondal
Biometrics
57
12561264
(2001)

 Abswoude:2004a


Mokken scale analysis using hierarchical clustering procedures
A. A. H. van Abswoude and J. K. Vermunt and B. T. Hemker and L. A. van der Ark
Applied Psychological Measurement
28
332354
(2004)

 Abswoude:2004


A comparative study of test data dimensionality assessment procedures under nonparametric IRT models
A. A. H. van Abswoude and L. A. van der Ark and K. Sijtsma
Applied Psychological Measurement
28
324
(2004)

 Ark:2001


Relationships and properties of polytomous item response theory models
L. A. van der Ark
Applied Psychological Measurement
25
273282
(2001)

 Noortgate:2003


Crossclassification multilevel logistic models in psychometrics
W. V. den Noortgate and P. D. Boeck and M. Meulders
Journal of Educational and Behavioral Statistics
28
369386
(2003)

 Rijmen:2005


A relation between a betweenitem multidimensional IRT model and the mixtureRasch model
F. Rijmen and P. D. Boeck
Psychometrika
70
481496
(2005)

 Martin:2006


IRT models for abilitybased guessing
E. S. Martin and G. del Pino
Applied Psychological Measurement
30
183203
(2006)

 Smits:2003


Estimation of the MIRID: A program and a SAS based approach
D. J. M. Smits and P. D. Boeck and N. Verhelst
Behavior Research Methods, Instruments, & Computers
35
537549
(2003)

 Verguts:2001a


Some MantelHaenszel tests of Rasch model assumptions
T. Verguts and P. D. Boeck
Journal of Mathematical & Statistical Psychology
54
2137
(2001)

 Verguts:2000a


A note on the MartinLöf test for unidimensionality
T. Verguts and P. D. Boeck
Methods of Psychological Research  Online
5
7782
(2000)

 Tuerlinckx:2001


Nonmodeled item interactions can lead to distorted discrimination parameters: A case study
F. Tuerlinckx and P. D. Boeck
Methods of Psychological Research  Online
6
159174
(2001)

 Tuerlinckx:2006


Statistical inference in generalized linear mixed models: A review
F. Tuerlinckx and F. Rijmen and G. Verbeke and P. D. Boeck
British Journal of Mathematical & Statistical Psychology
59
225255
(2006)

 Tuerlinckx:2005


Two interpretations of the discrimination parameter
F. Tuerlinckx and P. D. Boeck
Psychometrika
70
629650
(2005)
In this paper we propose two interpretations for the discrimination parameter in the twoparameter logistic model (2PLM). The interpretations are based on the relation between the 2PLM and two stochastic models. In the first interpretation, the 2PLM is linked to a diffusion model so that the probability of absorption equals the 2PLM. The discrimination parameter is the distance between the two absorbing boundaries and therefore the amount of information that has to be collected before a response to an item can be given. For the second interpretation, the 2PLM is connected to a specific type of race model. In the race model, the discrimination parameter is inversely related to the dependency of the information used in the decision process. Extended versions of both models with persontoperson variability in the difficulty parameter are considered. When fitted to a data set, it is shown that a generalization of the race model that allows for dependency between choices and response times (RTs) is the bestfitting model.

 Verstralen:2000ab


A DOUBLE HAZARD MODEL FOR MENTAL SPEED
H. H. F. M. Verstralen and N. D. Verhelst and T. M. Bechger
(2000)
The administration of tests via the computer allows the registration of response times along with the actual response. This paper describes a model that combines these two kinds of data to estimate a subject latent variable usually called mental speed, but more appropriately called mental power. The model implies that the expected item score increases with invested time. Nevertheless, it allows for a decreasing expected item score with response time, which is sometimes found in experiments. This paradox is obtained by assuming that a subject not only stops working on a problem because of time pressure, but also when he has solved the problem. The model builds on a familiar framework of IRT models. An MML estimation procedure is developed, and model fit on the item level is evaluated using Lagrange
multiplier tests.

 Verstralen:1998aa


A Latent IRT Model for Options of Multiple Choice Items
H. H. F. M. Verstralen
(1998)
A latent IRT model for the analysis of multiple choice questions is proposed. The incorrect options of an item are associated with a decreasing logistic function that models the probability of being judged correct. It is assumed that the correct option is always recognized as such. According to the model a subject selects
randomly from the subset of options considered correct. Like its companion treated in Verstralen (1997) the model can be viewed as a generalization of Nedelsky's (1954) method to determine a pass/fail score. With this other model it has in common that the ML latent variable estimator gains some precision
compared to binary scoring. Both models also share some other favorable psychometric properties.

 Verstralen:1998ab


A Logistic Latent Class Model for Multiple Choice Items
H. H. F. M. Verstralen
(1998)
A logistic latent class model for the analysis of options of a class of multiple choice items is presented. For each item a set of latent classes with a chain structure is assumed. The probability of latent class membership is modeled by a logistic function. The conditional probability of the observed response, the
selection of an option, given the latent class membership is assumed to be constant. The model can be viewed as a generalization of Nedelsky's (1954) method to determine a pass/fail score. Apart from giving a more detailed model on the process of solving a multiple choice item an increase in the precision of
latent variable estimates in comparison with binary scoring is achieved. The model is shown to possess some favorable psychometric properties.

 Rijn:2000aa


A Selection Procedure for Polytomous Items in Computerized Adaptive Testing
P. W. van Rijn and T. J. H. M. Eggen and B. T. Hemker and P. F. Sanders
(2000)
In the present study, a procedure which was developed to select dichotomous items in
computerized adaptive testing was applied to polytomous items. The aim of this procedure is to select the item with maximum weighted information. In a simulation study, the item information function was integrated over a fixed interval of ability values and the item with the maximum area was selected. This maximum interval information item selection procedure was compared to a maximum point information item selection procedure. No substantial differences between the two item selection procedures were found when computerized adaptive tests were evaluated on bias and root mean square of the ability estimate.

 Bechger:2001aa


About the Cluster Kappa Coefficient
T. M. Bechger and B. T. Hemker and G. K. J. Maris
(2001)
The cluster kappa was proposed by Schouten (1982) as a measure of chancecorrected rater agreement suitable for studies where objects are rated on a categorical scale by two or more judges. We discuss a way to calculate the cluster kappa which is suited even if ratings are missing. Further, we demonstrate how the sampling error of the cluster kappa may be estimated.

 Ruitenburg:2006aa


ALGORITHMS FOR PARAMETER ESTIMATION IN THE RASCH MODEL
J. van Ruitenburg
(2006)

 Verschoor:2005aa


An Approximation of Cronbach's Î$\pm$ and its Use in Test Assembly
A. J. Verschoor
(2005)
In this paper a new approximation of Cronbach's Î$\pm$ is presented. It is especially suited in the context of test assembly. Using this approximation, two test assembly models
are introduced. Being nonlinear models, they are solved by Genetic Algorithms as the
commonly used Linear Programming methods cannot be used here. A comparison is made
with existing test assembly models.

 Maris:2004aa


AN INTRODUCTION TO THE DAT GIBBS SAMPLER FOR THE TWOPARAMETER LOGISTIC (2PL) MODEL AND ITS APPLICATION
G. Maris and T. M. Bechger
(2004)

 Maris:2003aa


Are attitude items monotone or single peaked? An analysis using bayesian methods
G. Maris
(2003)

 Hickendorff:2005aa


Clustering Nominal Data with Equivalent Categories: a Simulation Study Comparing Restricted GROUPALS and Restricted Latent Class Analysis
M. Hickendorff
(2005)

 Bechger:2003ab


Combining classical test theory and item response theory
T. Bechger and G. Maris and A. Béguin and H. Verstralen
(2003)

 Straetmans:1998aa


Comparison of Test Administration Procedures for Placement Decisions in a Mathematics Course
G. J. J. M. Straetmans and T. J. H. M. Eggen
(1998)
In this study, three different test administration procedures for making placement decisions in adult education were compared: a paperbased test (PBT), a computerbased test (CBT), and a computerized adaptive test (CAT). All tests were prepared from an item response theory calibrated item bank. The subjects were 90 volunteer students from three adult education schools. They were randomly assigned to one of six experimental groups to take two tests which
differed in mode of administration. The results indicate that test performance was not differentially affected by the mode of administration and that the CAT always
yielded more accurate ability estimates than the two other test administration procedures. The CAT was also found to be capable of making placement decisions with a test that was on average 24% shorter.

 Straetmans:2003aa


Computerize Adaptive Testing: What It Is and How It Works
G. J. J. M. Straetmans and T. J. H. M. Eggen
(2003)

 Maris:2003ac


Concerning the identification of the 3PL model
G. Maris
(2003)

 Beguin:2001aa


Effect of Noncompensatory Multidimensionality on Separate and Concurrent estimation in IRT Observed Score Equating
A. A. Béguin and B. A. Hanson
(2001)
In this article, the results of a simulation study comparing the performance of separate and concurrent estimation of a unidimensional item response theory (IRT) model applied to multidimensional noncompensatory data are reported. Data were simulated according to a twodimensional noncompensatory IRT model for both equivalent and nonequivalent groups designs. The criteria used were the accuracy
of estimating a distribution of observed scores, and the accuracy of IRT observed score equating. In general, unidimensional concurrent estimation resulted in lower or equivalent total error than separate estimation, although there were a few cases where separate estimation resulted in slightly less error than concurrent estimation.
Estimates from the correctly specified multidimensional model generally resulted in
less error than estimates from the unidimensional model. The results of this study,
along with results from a previous study where data were simulated using a compensatory multidimensional model, make clear that multidimensionality of the data affects the relative performance of separate and concurrent estimation, although the degree to which the unidimensional model produces biased results with multidimensional data depends on the type of multidimensionality present.

 Bechger:2000aa


Equivalent Linear Logistic Test Models
T. M. Bechger and H. H. F. M. Verstralen and N. D. Verhelst
(2000)
This paper is about the Linear Logistic Test Model (LLTM). We demonstrate that there are infinitely many equivalent ways to specify a model . An implication is that there may well be many ways to change the specification of a given LLTM and achieve the same improvement in model fit. To illustrate this phenomenon we analyze a real data set us ing a Lagrange multiplier test for the specification of the model.

 Maris:2003ab


Equivalent mirid models
G. Maris and T. Bechger
(2003)

 Verhelst:2000aa


Estimating the Reliability of a Test from a Single Test Administration
N. D. Verhelst
(2000)
The article discusses methods of estimating the reliability of a test from a single test administration. In the first part a review of existing indices is given, supplemented with two heuristics to approximate Guttmanís Î»4 and a
new similar coefficient. Special attention is given to the greatest lower bound, to its meaning as well as to the problems in computing it. In the second part the relation between Cronbachís Î$\pm$ and the reliability is studied by means of
a factorial model for the item scores. This part gives some useful formulae to appreciate the amount with which the reliability is underestimated when Î$\pm$ is used as its estimand. In the last part, the sampling distribution of
the indices is investigated by means of two simulation studies, showing that the indices exhibit severe bias, the direction of which depends partly on the factorial structure of the test. For three indices the bias is modeled. The
model describes the bias accurately for all cases studied in the simulation studies. It is shown how this bias correction may be applied in the case of a single data set.

 Verstralen:2006aa


Explorations in recursive designs
H. Verstralen
(2006)
Starting from a set of basic designs, more complex designs are created by recursive application of the basic designs. Properties of these designs, and their effects on the accuracy of Rasch CMLparameter estimates are investigated.

 Maris:2005aa


FUZZY SET THEORY âŠ† PROBABILITY THEORY?
G. Maris
(2005)

 Bechger:2000ab


Identifiability of NonLinear Logistic Test Models
T. M. Bechger and N. D. Verhelst and H. H. F. M. Verstralen
(2000)
The linear logistic test model (LLTM) specifies the item parameters as a weighted sum of basic parameters. The LLTM is a special case of a more
general nonlinear logistic test model (NLTM) where the weights are partially unknown. This paper is about the identifiability of the NLTM. Sufficient and necessary conditions for global identifiability are presented for a NLTM where
the weights are linear functions, while conditions for local identifiability are shown to require less assumptions. It is also discussed how these conditions are checked using an algorithm due to Bekker, Merckens, and Wansbeek (1994). Several illustrations are given.

 Huitzing:2004aa


Infeasibility in Automated Test Assembly Models: A Comparison Study of Different Methods
H. A. Huitzing and B. P. Veldkamp and A. J. Verschoor
(2004)
Several techniques exist to automatically put together a test meeting a number of
specifications. In an item bank, the items are stored with their characteristics. A test is
constructed by selecting a set of items that fulfills the specifications set by the test
assembler. Test assembly problems are often formulated in terms of a model consisting
of restrictions and an objective to be maximized or minimized. A problem arises when it
is impossible to construct a test from the item pool that meets all specifications, that is,
when the model is not feasible. Several methods exist to handle these infeasibility
problems.
In this paper, test assembly models resulting from two practical testing programs
were reconstructed to be infeasible. These models were analyzed using methods that
either forced a solution (Goal programming, MultipleGoal programming, Greedy
Heuristic), that analyzed the causes (Relaxed and Ordered Deletion Algorithm, Integer
Randomized Deletion Algorithm, Set Covering and Item Sampling), or that analyzed the
causes and used this information to force a solution (IrreducibleInfeasibleSet Solver).
Specialized methods like the Integer Randomized Deletion Algorithm, and the
IrreducibleInfeasibleSetSolver performed best. Recommendations about the use of
different methods are given.

 Verstralen:2000aa


IRT models for subjective weights of options of multiple choice questions.
H. H. F. M. Verstralen and N. D. Verhelst
(2000)
From earlier investigations it was found that the information from Multiple Choice (MC) questions could be increased about four fold by having the subject indicate the subset of options that he is unable to expose as false. In the present models this approach is general ized by having the subject distribute a number of 'taws' over the options, or draw a line after the options, such that the number of taws given to an option, or the line length rejects its subjective degree of correctness. It appears that even with values of the relevant parameters that seem modest, the information relative to binary scoring still is in excess of two. This means that with less than half the test length the same accuracy or reliability can be obtained as with binary scoring. With a real data set we found a relative information greater than five.
If a few main fallacies can be rejected in the distractors of the items, the model can be applied to identify subjects with one of these fallacies.

 Verschoor:2004aa


IRT Test Assembly Using Genetic Algorithms
A. J. Verschoor
(2004)
This paper intro duces a new class of ptimisation methods in test assembly: Genetic Algorithms (GAs). In the first part an overview is given of the concepts and principles of GAs, in the second part they are applied to three commonly used test assembly models using Item Response Theory. Simulation studies are performed in order to find conditions under which GAs can be successfully used.

 Eggen:1998ab


Item Selection in Adaptive Testing with the Sequential Probability Ratio Test
T. J. H. M. Eggen
(1998)
Computerized adaptive tests (CATs) were originally developed to obtain an efficient estimate of an examinee's ability. For classification problems, applications of the Sequential Probability Ratio Test (Wald, 1947) have been shown to be a promising alternative for testing algorithms which are based on statistical estimation. However, the method of item selection currently being used in these
algorithms, which use statistical testing to infer on the examinees, is either random or based on a criterion which is related to optimizing estimates of examinees (maximum (Fisher) information). In this study, an item selection method based on KullbackLeibler information is presented, which is theoretically more suitable
for statistical testing problems and which can improve the testing algorithm for classification problems.
Simulation studies were conducted for two and threeway classification problems, in which item selection based on Fisher information and KullbackLeibler information were compared. The results of these studies showed that the
performance of the testing algorithms with KullbackLeibler informationbased item selection are sometimes better and never worse than algorithms with Fisher informationbased item selection.

 Eggen:2004aa


Loss of Information in Estimating Item Parameters in Incomplete Designs
T. J. H. M. Eggen and N. D. Verhelst
(2004)
In this paper, the efficiency of conditional maximum likelihood (CML) and marginal maximum likelihood (MML) estimation of the item parameters of the Rasch model in incomplete designs is studied. The use of the concept of Finformation (Eggen, 2000) is generalized to incomplete testing designs. The standardized determinant of the Finformation matrix is used for a scalar measure of information in a set of item parameters. In this paper, the relation between the normalization of the Rasch model and this determinant is clarified. It is shown that comparing estimation methods with the defined information efficiency is independent of the chosen normalization.
In examples, information comparisons are conducted. It is found that for both CML and
MML some information is lost in all incomplete designs compared to complete designs. A
general trend is that with increasing test booklet length the efficiency of an incomplete to a
complete design and also the efficiency of CML compared to MML is increasing. The main
differences between CML and MML is seen in relation to the length of the test booklet. It will
be demonstrated that with very small booklets, there is a substantial loss in information (about
35%) with CML estimation, while this loss is only about 10% in MML estimation. However,
with increasing test length, the differences between CML and MML quickly disappear.

 Verhelst:1998aa


Modeling Sums of Binary Responses by the Partial Credit Model
N. D. Verhelst and H. H. F. M. Verstralen
(1998)
The Partial Credit Model (PCM) is sometimes interpreted as a model for stepwise solution of polytomously scored items, where the item parameters are interpreted as difficulties of the steps. It is argued that this interpretation is not justified. A model for stepwise solution is discussed. It is shown that the PCM is suited to
model sums of binary responses which are not supposed to be stochastically independent. As a practical result, a statistical test of stochastic independence in the Rasch model is derived.

 Hemker:2000aa


On Measurement Properties of Continuation Ratio Models
B. T. Hemker and L. A. van der Ark and K. Sijtsma
(2000)
Three classes of polytomous IRT models are distinguished. These classes are the adjacent
category models, the cumulative probability models, and the continuation ratio models. So far, the latter class has received relatively little attention. The class of continuation ratio models
includes logistic models, such as the sequential model (Tutz, 1990), and nonlogistic models,
such as the acceleration model (Samejima, 1995) and the nonparametric sequential model (Hemker, 1996). Four measurement properties are discussed. These are monotone likelihood
ratio of the total score, stochastic ordering of the latent trait by the total score, stochastic
ordering of the total score by the latent trait, and invariant item ordering. These properties
have been investigated previously for the adjacent category models and the cumulative
probability models, and for the continuation ratio models this is done here. It is shown that
stochastic ordering of the total score by the latent trait is implied by all continuation ratio
models, while monotone likelihood ratio of the total score and stochastic ordering on the
latent trait by the total score are not implied by any of the continuation ratio models. Only the
sequential rating scale model implies the property of invariant item ordering. Also, we present a Venndiagram showing the relationships between all known polytomous IRT models from all three classes.

 Eggen:1998aa


On the Loss of Information in Conditional Maximum Likelihood Estimation of Item Parameters
T. J. H. M. Eggen
(1998)
In item response models of the Rasch type (Fischer & Molenaar, 1995), item
parameters are often estimated by the conditional maximum likelihood (CML)
method. This paper addresses the loss of information in CML estimation by using
the information concept of Finformation (Liang, 1983). This concept makes it
possible to specify the conditions for no loss of information and to define a
quantification of information loss. For the dichotomous Rasch model, the
derivations will be given in detail to show the use of the Finformation concept
for making efficiency comparisons for different estimation methods. It is shown
that by using CML for item parameter estimation, some information is almost
always lost. But compared to JML (joint maximum likelihood) as well as to MML
(marginal maximum likelihood) the loss is very small. The reported efficiency of
CML to JML and to MML in several comparisons is always larger than 93%, and
in tests with a length of 20 items or more, larger than 99%.

 Eggen:2004ab


Optimal Testing With Easy or Difficult Items in Computerized Adaptive Testing
T. J. H. M. Eggen and A. J. Verschoor
(2004)
Computerized adaptive tests (CATs) are individualized tests which, from a measurement point of view, are optimal for each individual, possibly under some practical conditions. In the
present study it is shown that maximum information item selection in CATs using an item bank which is calibrated with the one or the twoparameter logistic model, results in each individual answering about 50% of the items correctly. Two item selection procedures giving easier (or more difficult) tests for students are presented and evaluated. Item selection on probability points of items yields good results only with the 1pl model and not with the 2pl model. An alternative selection procedure, based on maximum information at a shifted ability level, gives satisfactory results with both models.

 Eggen:2001aa


Overexposure and underexposure of items in computerized adaptive testing
T. J. H. M. Eggen
(2001)
Computerized adaptive tests (CATS) have shown to be considerably more efficient than paperandpencil tests. This gain is realized by offering each candidate the most informative item from an available item bank on the basis of the results of items that have already been administered. The item selection methods that are used to compose an optimum test for each individual do, however, have a number of drawbacks. Though a CAT generally presents each candidate with a different test, it often occurs that some items from the item bank are administered very frequently while others are never or hardly ever used. These two problems, i.e., overexposure and underexposure of items, can be eliminated by adding further restrictions to the item selection methods. However, this exposure control will affect the efficiency of the CAT. This paper presents a solution for both problems. The functioning of these methods will be illustrated with the results of simulation research that has been carried out to develop adaptive tests.

 Roelofs:2001aa


Preferences for various learning environments: Teachers' and parents' perceptions
E. C. Roelofs and J. J. C. M. Visser
(2001)
In the last ten years, a number of innovations, mainly inspired by constructivist notions of
learning, have been introduced in various levels of the Dutch educational system. However,
constructivist learning environments are rarely implemented. Teachers tend to stick to
expository and structured learning environments. This consistent finding requires research in order to gain insight into teachersí preferences for learning environments and to determine the factors that support and impede the realization of these learning environments. Regarding the influence of social backgrounds on student learning, is it also important to take stock of parental views on learning environments.
This study is focused on teachers' preferences for learning environments, their reported
teaching behavior, and how these match with parents' preferences. Three parallel
questionnaires were developed for teachers (n=281), students (n=952), and parents (n=717), measuring preferences and behavior in different levels of education, for three types of
learning environments: direct instruction, discovery learning, and authentic pedagogy.
The results show that teachers often prefer direct instruction, and seldom promote discovery
learning. While teachers sometimes realize authentic pedagogy, constructive learning tasks
are seldom used. Teachers' reported practice and parents' preferences for their children appear
to correspond reasonably.
Results of multiple regression analyses show that the use of the three types of learning
environments yield different predictors. For the use of discovery learning and authentic
pedagogy, confidence in students' regulative skills is an important predictor. In predicting the
use of direct instruction, the teacher's own conception of learning turns out to be an important
predictor.

 Bechger:2004aa


STRUCTURAL EQUATION MODELLING OF MULTIPLE FACET DATA: EXTENDING MODELS FOR MULTITRAITMULTIMETHOD DATA
T. M. Bechger and G. Maris
(2004)
This paper is about the structural equation modelling of quantitative measures that are obtained from a multiple facet design. A facet is simply a set consisting of a finite number of elements. It is assumed that measures are
obtained by combining each element of each facet. Methods and traits are two such facets, and a multitraitmultimethod study is a twofacet design. We extend models that were proposed for multitraitmultimethod data by
Wothke (1984;1996) and Browne (1984, 1989, 1993), and demonstrate how they can be fitted using standard software for structural equation modelling. Each model is derived from the model for individual measurements in order to clarify the first principles underlying each model.

 Verhelst:2002aa


Testing the unidimensionality assumption of the Rasch model
N. Verhelst
(2002)
Statistical tests especially designed to test the unidimensionality axiom of the Rasch model are scarce. For two of them, the MartinLöf test
(MLtest) and the splitteritemtechnique, an extensive power analysis has been carried out , showing clearly the superiority of the MLtest. The disadvantage of the MLtest, however, is that its null distribution deviates strongly from the asymptotic chisquare distribution unless one has huge samples. A new test with one degree of freedom is proposed. Its power is
superior to that of the MLtest, and its null distribution converges rapidly to the chisquare.

 Verstralen:2001aa


The Combined Use of Classical Test Theory and Item Response Theory
H. Verstralen and T. Bechger and G. Maris
(2001)
The present paper is about a number of relations between concepts of models from classical test theory (CTT), such as reliability, and item response theory (IRT). It is demonstrated that the use of IRT models allows us to extend the range of applications of CTT, and investigate relations among concepts that are central in CTT such as reliability and itemtest correlation.

 Bechger:2003ac


The componential Nedelsky model: A first exploration
T. Bechger and G. Maris
(2003)

 Bechger:2003aa


The Nedelsky model for multiple choice items
T. Bechger and G. Maria and H. Verstralen and N. Verhelst
(2003)

 Maris:2003ad


Two methods for the practical analysis of rating data
G. Maris and T. Bechger
(2003)

 Verguts:2001


Some MantelHaenszel tests of Rasch model assumptions
T. Verguts and P. D. Boeck
Journal of Mathematical and Statistical Psychology
54
2137
(2001)

 Verguts:2000


A note on the MartinLöf test for unidimensionality
T. Verguts and P. D. Boeck
Methods of Psychological Research Online
5
(2000)

 Prieto:2003


Classical test theory versus Rasch analysis for quality of life questionnaire reduction
L. Prieto and J. Alonso and R. Lamarca
Health and Quality of Life Outcomes
1
(2003)

 Jiao:2004


Evaluating the dimensionality of the Michigan English Language Assessment Battery
H. Jiao
2
2752
(2004)

 Rizopoulos:2005


Nonlinear effects in generalized latent variable models
D. Rizopoulos
(2005)

 Smith:2007


A Rasch and factor analysis of the Functional Assessment of Cancer TherapyGeneral (FACTG)
A. B. Smith and P. Wright and P. J. Selby and G. Velikova
Health and Quality of Life Outcomes
5
(2007)

 Raiche:2005


Critical eigenvalue sizes in standardized residual principal components analysis
G. Raîche
Rasch Measurement Transactions
19
1012
(2005)

 Smith:2005


Rasch analysis of the dimensional structure of the hospital anxiety and depression scale
A. B. Smith and E. P. Wright and R. Rush and D. P. Stark and G. Velikova and P. J. Selby
PsychoOncology
(2005)

 Flieller:1994


Méthodes d'étude de l'adéquation au modèle logistique à un paramètre (modèle de Rasch)
A. Flieller
Mathématiques et Sciences Humaines
127
1947
(1994)

 Orlando:2000


Critical issues to address when applying Item Response Theory (IRT) models
M. Orlando
(2000)

 Bock:1970


Fitting a response model for n dichotomously scored items
R. D. Bock and M. Lieberman
Psychometrika
35
179197
(1970)

 Socan:2000


Assessment of reliability when test items are not essentially tauequivalent
G. Socan
(2000)

 Junker:1996


Exploring monotonicity in polytomous item response data
B. W. Junker
(1996)

 Junker:2000a


Nonparametric IRT in Action: An overview of the special issue
B. W. Junker and K. Sijtsma
(2000)

 Linardakis:1996


An approach to multidimensional item response modeling
M. Linardakis and P. Dellaportas
(1996)

 Junker:2000


Monotonicity and conditional independence in models for student assessment and attitude measurement
B. W. Junker
(2000)

 Rudner:2001


Measurement Decision Theory
L. M. Rudner
(2001)

 Bentler:2004


Maximal reliability for unitweighted composites
P. M. Bentler
(2004)

 Sijtsma:1994


A survey of theory and methods of invariant item ordering
K. Sijtsma and B. W. Junker
(1994)

 Mazor:1995


Using logistic regression and the MantelHaenszel with multiple ability estimates to detect Differential Item Functioning
K. M. Mazor and A. Kanjee and B. E. Clauser
Journal of Educational Measurement
32
131144
(1995)

 Zwick:1990


When do item reponse function and MantelHaenszel definitions of Differential Item Functioning coincide?
R. Zwick
Journal of Educational Statistics
15
185197
(1990)

 Shapiro:2000


The asymptotic bias of minimum trace factor analysis, with applications to the greatest lower bound to reliability
A. Shapiro and J. M. F. T. Berge
Psychometrika
65
413425
(2000)

 Callender:1979


An empirical comparison of coefficient alpha, guttman's lambda2, and MSPLIT maximized splithalf reliability estimates
J. C. Callender and H. G. Osburn
Journal of Educational Measurement
16
89
(1979)

 Adema:1989


Algorithms for computerized test construction using classical item parameters
J. J. Adema and W. J. van der Linden
Journal of Educational Statistics
14
279290
(1989)

 Armstrong:1998


Optimization of classical reliability in test construction
R. D. Armstrong and D. H. Jones
Journal of Educational and Behavioral Statistics
23
117
(1998)

 Raiche:2002a


La simulation d'un test adaptatif basé sur le modèle de Rasch
G. Raîche
(2002)

 Kintsch:1999


The role of longterm memory in text comprehension
W. Kintsch and V. L. Patel and K. A. Ericsson
Psychologia
42
186198
(1999)

 Hermann:1999


Assessing leadership style: A trait analysis
M. G. Hermann
(1999)

 Papadimitriou:1997


Latent semantic indexing: A probabilistic analysis
C. H. Papadimitriou and P. Raghavan and H. Tamaki
(1997)

 Deerwester:1990


Indexing by latent semantic analysis
S. Deerwester and S. T. Dumais and R. Harshman
Journal of the American Society for Information Science
41
391407
(1990)

 Leeuw:2003


Principal component analysis with binary data. Applications to rollcall analysis
J. de Leeuw
(2003)

 Lawrence:2005


Probabilistic nonlinear principal component analysis with gaussian process latent variables models
N. Lawrence
Journal of Machine Learning Research
6
17831816
(2005)

 Kemkes:2006


Objective scoring for computing competition tasks
G. Kemkes and T. Vasiga and G. Cormack
(2006)

 Eggen:2005


Computerized adaptive testing
T. Eggen
(2005)

 Krzanowski:2006


Sensitivity in metric scaling and analysis of distance
W. J. Krzanowski
Biometrics
62
239244
(2006)

 Hofmann:1999


Probabilistic latent semantic analysis
T. Hofmann
(1999)

 Farahat:2006


Improving probabilistic latent semantic analysis with principal component analysis
A. Farahat and F. Chen
(2006)

 Allegre:2003


Un système d'observation et d'analyse en direct de séances d'enseignement
E. Allègre and P. Dessus
(2003)

 Rehder:1998


Using latent semantic analysis to assess knowledge: Some technical considerations
B. Rehder and M. E. Schreiner and M. B. W. Wolfe and D. Laham
(1998)

 Wolfe:1998


Learning from text: Matching readers and texts by latent semantic analysis
M. B. W. Wolfe and M. E. Schreiner and B. Rehder and D. Laham
Discourse Processes
25
309336
(1998)

 Foltz:1998


The measurement of textual coherence with latent semantic analysis
P. W. Foltz and W. Kintsch and T. K. Landauer
Discourse Processes
25
285307
(1998)

 Landauer:1998


An introduction to latent semantic analyses
T. K. Landauer and P. W. Foltz and D. Laham
Discourse Processes
25
259284
(1998)

 Huang:2003


Psychometric analyses based on evidencecentered design and cognitive science of learning to explore students' problemsolving in physics
C. Huang
(2003)

 Lazarevska:2005


The distinctive language of terrorists
E. Lazarevska and J. M. Sholl and M. Young
(2005)

 Zumbo:1999


A handbook on the theory and methods of Differential Item Functioning (DIF)
B. D. Zumbo
(1999)

 Agresti:2005


Bayesian inference for categorical data analysis
A. Agresti and D. Hitchcock
Statistical Methods and Application (Journal of the Italian Statistical Society)
(2005)

 Davis:2002


Strategies for controlling item exposure in computerized adaptive testing with polytomously scored items
L. L. Davis
(2002)

 Landauer:1997


How Well Can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans
T. K. Landauer and D. Laham and B. Rehder and M. E. Schreiner
(1997)

 Landauer:2004


From paragraph to graph: latent semantic analysis for information visualization
T. K. Landauer and D. Laham and M. Derr
Proceedings of the National Academy of Sciences USA
101
52145219
(2004)
Most techniques for relating textual information rely on intellectually created links such as authorchosen keywords and titles, authority indexing terms, or bibliographic citations. Similarity of the semantic content of whole documents, rather than just titles, abstracts, or overlap of keywords, offers an attractive alternative. Latent semantic analysis provides an effective dimension reduction method for the purpose that reflects synonymy and the sense of arbitrary word combinations. However, latent semantic analysis correlations with human texttotext similarity judgments are often empirically highest at approximately 300 dimensions. Thus, two or threedimensional visualizations are severely limited in what they can show, and the first and/or second automatically discovered principal component, or any three such for that matter, rarely capture all of the relations that might be of interest. It is our conjecture that linguistic meaning is intrinsically and irreducibly very high dimensional. Thus, some method to explore a high dimensional similarity space is needed. But the 2.7 x 10(7) projections and infinite rotations of, for example, a 300dimensional pattern are impossible to examine. We suggest, however, that the use of a high dimensional dynamic viewer with an effective projection pursuit routine and user control, coupled with the exquisite abilities of the human visual system to extract information about objects and from moving patterns, can often succeed in discovering multiple revealing views that are missed by current computational algorithms. We show some examples of the use of latent semantic analysis to support such visualizations and offer views on future needs.

 Bianco:2005


Modélisation des processus de hiérarchisation et d'application de macrorègles et conception d'un prototype d'aide au résumé
M. Bianco and P. Dessus and B. Lemaire and S. Mandin and P. Mendelsohn
(2005)

 Laham:1997


Latent semantic analysis approaches to categorization
D. Laham
979
(1997)

 Bestgen:2002


L'analyse sémantique latente et l'identification des métaphores
Y. Bestgen and A. Cabiaux
(2002)

 Hernandez:2006


A Procedure for Estimating Intrasubject Behavior Consistency
J. M. Hern\'{a}ndez and V. J. Rubio and J. Revuelta and J. Santacreu
Educational and Psychological Measurement
66
417434
(2006)
Trait psychology implicitly assumes consistency of the personal traits. Mischel, however, argued against the idea of a general consistency of human beings. The present article aims to design a statistical procedure based on an adaptation of the $\pi^*$ statistic to measure the degree of intraindividual consistency independently of the measure used. Three studies were carried out for testing the suitability of the $\pi^*$ statistic and the proportion of subjects who act consistently. Results have shown the appropriateness of the statistic proposed and that the percentage of consistent individuals depends on whether test items can be assumed as equivalents and the number of response alternatives they contained. The results suggest that the percentage of consistent subjects is far from 100%, and this percentage decreases when items are equivalent. Moreover, the greater the number of response options, the lesser the percentage of consistent individuals.

 Revuelta:2004


Analysis of distractor difficulty in MultipleChoice items
J. Revuelta
Psychometrika
69
217234
(2004)
Two psychometric models are presented for evaluating the difficulty of the distractors in multiplechoice items. They are based on the criterion of rising distractor selection ratios, which facilitates interpretation of the subject and item parameters. Statistical inferential tools are developed in a Bayesian framework: modal a posteriori estimation by application of an EM algorithm and model evaluation by monitoring posterior predictive replications of the data matrix. An educational example with real data is included to exemplify the application of the models and compare them with the nominal categories model.

 Wang:1998


An ANOVAlike Rasch analysis of differential item functioning
W. Wang
(1998)

 Blais:2003


Une étude de l'accord et de la fidélité inter juges comparant un modèle de la théorie de la généralisabilité et un modèle de la famille de Rasch
J. Blais and N. Loye
(2003)

 Raiche:2002


Objective measurement, Theory into practice
G. Raîche and J. Blais
6
(2002)

 Bailey:2001


Ideal point estimation with a small number of votes: A randomeffects approach
M. Bailey
Political Analysis
9
192210
(2001)

 Youness:2004


Contributions à une méthodologie de comparaison de partitions
G. Youness
(2004)

 Way:2006


Practical questions in introducing computerized adaptive testing for K12 assessments
W. D. Way and L. L. Davis and S. Fitzpatrick
(2006)

 Schein:2003


A generalized linear model for principal component analysis of binary data
A. I. Schein and L. K. Saul and L. H. Ungar
(2003)

 Hardouin:2005


Construction d'échelles d'items unidimensionnelles en qualité de vie
J. Hardouin
(2005)

 Klein:2005


Graphical models for panel studies, illustrated on data from the framingham heart study
J. P. Klein and N. Keiding and S. Kreiner
(2005)

 Partchev:2004


A visual guide to item response theory
I. Partchev
(2004)

 Gruijter:2005


Statistical test theory for education and psychology
D. N. M. de Gruijter and L. J. T. van der Kamp
(2005)

 Boeck:2004


Explanatory Item Response Models: a Generalized Linear and Nonlinear Approach
P. D. Boeck and M. Wilson
(2004)
http://www.springer.com/west/home?SGWID=410222269224280&changeHeader=true&SHORTCUT=www.springer.com/9780387402758

 Rijmen:2003


A nonlinear mixed model framework for item response theory
F. Rijmen and F. Tuerlinckx and P. D. Boeck and P. Kuppens
Psychological Methods
8
185205
(2003)
Mixed models take the dependency between observations based on the same cluster
into account by introducing 1 or more random effects. Common item response
theory (IRT) models introduce latent person variables to model the dependence
between responses of the same participant. Assuming a distribution for the latent
variables, these IRT models are formally equivalent with nonlinear mixed models.
It is shown how a variety of IRT models can be formulated as particular instances
of nonlinear mixed models. The unifying framework offers the advantage that
relations between different IRT models become explicit and that it is rather straight
forward to see how existing IRT models can be adapted and extended. The ap
proach is illustrated with a selfreport study on anger.

 May:2006


A multilevel bayesian item response theory method for scaling socioeconomic status in international studies of education
H. May
Journal of Educational and Behavioral Statistics
31
6379
(2006)
A new method is presented and implemented for deriving a scale of socioeconomic status (SES) from international survey data using a multilevel Bayesian item response theory (IRT) model. The proposed model incorporates both international anchor items and nationspecific items and is able to (a) produce student family SES scores that are internationally comparable, (b) reduce the influence of irrelevant national differences in culture on the SES scores, and (c) effectively and efficiently deal with the problem of missing data in a manner similar to Rubin's (1987) multiple imputation approach. The results suggest that this model is superior to conventional models in terms of its fit to the data and its ability to use information collected via international surveys.

 Borsboom:2004fj


The concept of validity
D. Borsboom and G. J. Mellenbergh and J. van Heerden
Psychological Review
111
10611071
(2004)
http://users.fmg.uva.nl/dborsboom/papers.htm
This article advances a simple conception of test validity: A test is valid for measuring an attribute if (a) the attribute exists and (b) variations in the attribute causally produce variation in the measurement outcomes. This conception is shown to diverge from current validity theory in several respects. In particular, the emphasis in the proposed conception is on ontology, reference, and causality, whereas current validity theory focuses on epistemology, meaning, and correlation. It is argued that the proposed
conception is not only simpler but also theoretically superior to the position taken in the existing literature. Further, it has clear theoretical and practical implications for validation research. Most important, validation research must not be directed at the relation between the measured attribute and other attributes but at the processes that convey the effect of the measured attribute on the test scores.

 Bond:2003


Validity and assessment: a rasch measurement perspective
T. G. Bond
Metodologia de las Ciencias del Comportamiento
5
179194
(2003)
This paper argues that the Rasch model, unlike the other models generally referred to as IRT models, and those that fall into the tradition of True Score models, encompasses a set of rigorous prescriptions for what scientific measurement would be like if it were to be achieved in the social sciences. As a direct consequence, the Rasch measurement approach to the construction and monitoring of variables is sensitive to the issues raised in Messick's (1995) broader conception of construct validity. The theory / practice dialectic (Bond & Fox, 2001) ensures that validity is foremost in the mind of those developing measures and that genuine scientific measurement is foremost in the minds of those who seek valid outcomes from assessment. Failures of invariance, such as those referred to as DIF, should alert researchers to the need to modify assessment procedures or the substantive theory under investigation, or both.
