The effects of matrix sampling on student score comparability in constructed-response and multiple-choice assessments
J. Dings and R. Childs and N. Kingston

D. R. Thomas and I. R. R. Lu and B. D. Zumbo

Fitting a finite mixture distribution to a variable subject to heteroscedastic measurement error
M. Thamerus

Scaling Methodology and Procedures for the TIMSS Mathematics and Science Scales
K. Yamamoto and E. Kulick

Using Trapezoidal Rule for the Area Under a Curve Calculation
S. Yeh

The SAS Macro-Program %AnaQol to Estimate the Parameters of Item Responses Theory Models
J. Hardouin
Communications in Statistics - Simulation and Computation  36  437-453  (2007)

Multilevel IRT Model Assessment
J. Fox

Modeling Measurement Error in Structural Multilevel Models
J. Fox and C. A. W. Glas

The Effect of Missing Data Imputation on Mokken Scale Analysis
L. A. van der Ark and K. Sijtsma

Statistical Models for Categorical Variables
L. A. van der Ark and M. A. Croon and K. Sijtsma

Hierarchically Related Nonparametric IRT Models, and Practical Data Analysis Methods
L. A. van der Ark and B. T. Hemker and K. Sijtsma

Progress in NIRT Analysis of Polytomous Item Scores: Dilemmas and Practical Solutions
K. Sijtsma and L. A. van der Ark

This paper discusses three open problems in nonparametric polytomous item re- sponse theory: (1) theoretically, the latent trait $\theta$ is not stochastically ordered by the observed total score X+; (2) the models do not imply an invariant item ordering; and (3) the regression of an item score on the total score X+ or on the restscore R is not a monotone nondecreasing function and, as a result, it cannot be used for investigating the monotonicity of the item step response function. Tentative solutions for these problems are discussed. The computer program MSP for nonparametric IRT analysis is based on models which neither imply the stochastic ordering property nor an invariant item ordering. Also, MSP uses item-restscore regression for investigating item step response functions. It is discussed whether computer programs may be based temporarily) on models which lack desirable properties and use methods which are not (yet) supported by sound psychometric theory.
Contributions to Latent Budget Analysis: A Tool For the Analysis of Compositional Data.
L. A. van der Ark

Graphical Display of Latent Budget Analysis and Latent Class Analysis, with Special Reference to Correspondence Analysis
L. A. van der Ark and P. G. M. van der Heijden

Some Examples of Latent Budget Analysis and its Extensions
P. G. M. van der Heijden and L. A. van der Ark and A. Mooijaart

D. R. Thomas and A. Cyr

Assessing agreement on classification tasks: the kappa statistic
J. Carletta
Computational Linguistics  22    (1996)

Sparse principal component analysis
H. Zou and T. Hastie and R. Tibshirani

Measuring Client Satisfaction with Public Education III: Group Effects in Client Satisfaction
T. G. Bond and J. A. King
Journal of Applied Measurement  4  326-334  (2003)

Measuring Client Satisfaction with Public Education II: Comparing Schools with State Benchmarks
T. G. Bond and J. A. King
Journal of Applied Measurement  4  258-268  (2003)

Measuring Client Satisfaction with Public Education I: Meeting Competing Demands in Establishing State-wide Benchmarks
J. A. King and T. G. Bond
Journal of Applied Measurement  4  111-123  (2003)

A Componential IRT Model for Guilt
D. J. M. Smits and P. D. Boeck
Multivariate Behavioral Research  38  161-188  (2003)

Evaluation of Relations between Scales in an IRT Framework
K. Jehangir

Neural network and logistic regression. Part I
M. Schumacher and R. Rossner and W. Vach

Un modèle de réponses aux items. Propriétés et comparaison de groupes de traitement en épidémiologie
J. Tricot and M. Mesbah
Revue de Statistique Appliquée  48  29-39  (2000)

Setting Cut Scores: Critical Review of Angoff and Modified-Angoff Methods
K. L. Ricker

This paper presents a critical review of the Angoff (1971) and Angoff derived methods, according to criteria for assessing cut score setting methods originally proposed by Berk (1986) and further recommendations by Hambleton (2001). The criteria have been updated to reflect the progress that has been made in standard setting research over the past 17 years. The paper also discusses the assumptions of the Angoff method, and other current issues surrounding this method. Recommendations for using the Angoff method are made.
Y. Sheng

IRT models for subjective weights of options of multiple choice questions
H. H. F. M. Verstralen and N. D. Verhelst

Exchangeable Rasch Matrices
S. L. Lauritzen

Bootstrap Inference in a Linear Equation Estimated by Instrumental Variables
R. Davidson and J. MacKinnon

P. Festy and L. Prokofieva

Presence-only data and the EM algorithm
G. Ward and T. Hastie and S. C. Barry and J. Elith and J. R. Leathwick
Biometrics      (2008)

On the applicability of some IRT models for repeated measurement designs: Conditions, consequences, and Goodness-of-Fit tests
I. Ponocny
Methods of Psychological Research Online  7  21-40  (2002)

A hierarchical model for estimating response time distributions
J. N. Rouder and J. Lu and P. Speckman and D. Sun and Y. Jiang
Psychonomic Bulletin & Review  12  195-223  (2005)

Power and Sample Size Determination for Linear Models
J. M. Castelloe and R. G. O'Brien

Support for an auto-associative model of spoken cued recall: Evidence from fMRI
G. de Zubicaray and K. McMahon and M. Eastburn and A. J. Pringle and L. Lorenz and M. S. Humphreys
Neuropsychologia  45  824-835  (2007)

The Added Value of Multidimensional IRT Models
R. D. Gibbons and J. C. Immekus and R. D. Bock

NAEP-QA FY06 Special Study: 12th Grade Math Trend Estimates
T. E. Diaz and H. A. Le and L. L. Wise

A fast dual algorithm for kernel logistic regression
S. S. Keerthi and K. Duan and S. K. Shevade and A. N. Poo

K. Byström and K. Järvelin

Development of a Short Form of the Severe Impairment Battery
J. Saxton and K. B. Kastango and L. Hugonot-Diener and F. Boller and M. Verny and C. E. Sarles and R. R. Girgis and E. Devouche and P. Mecocci and B. G. Pollock and S. T. DeKosky
American Journal of Geriatric Psychiatry  13    (2005)

A new approach for interexaminer reliability data analysis on dental caries calibration
A. V. Assaf and E. P. da Silva Tagliaferro and M. de Castro Meneghim and C. Tengan and A. C. Pereira and G. M. B. Ambrosano and F. L. Mialhe
Journal of Applied Oral Science  15    (2007)

Les banques d'items. Construction d'une banque pour le Test de Connaissance du Français
E. Devouche
Psychologie et Psychométrie  24  57-88  (2003)

T. N. Postlethwaite
Perspectives : revue trimestrielle d'éducation comparée  XXIII  697-707  (1993)

IRT-Based Internal Measures of Differential Functioning of Items and Tests
N. S. Roju and W. J. van der Linden and P. F. Fleer
Applied Psychological Measurement  19  353-368  (1995)

An interval scale for development of children aged 0 --2 years
G. Jacobusse and S. van Buuren and P. H. Verberk
Statistics in Medicine  25  2272-2283  (2006)

A model of grounded language acquisition: Sensorimotor features improve lexical and grammatical learning
S. R. Howell and D. Jankowicz and S. Becker
Journal of Memory and Language  53  258-276  (2005)

Differences between multiple-choice and constructed response items in PIRLS 2001
D. Hastedt

L. Devroye

This chapter provides a survey of the main methods in non-uniform random variate generation, and highlights recent research on the sub ject. Classical paradigms such as inversion, rejection, guide tables, and transformations are reviewed. We provide information on the expected time complexity of various algorithms, before addressing modern topics such as indirectly specified distributions, random processes, and Markov chain methods.
Multilevel statistical models
H. Goldstein

Multilevel Homogeneity Analysis
G. Michailidis

The Common European Framework

Exploratory Measurement Invariance: A New Method Based on Item Response Theory
A. W. Meade and J. K. Ellington and S. B. Craig

Examination Development Guidelines
M. E. Lunz

An empirical comparison of item response theory and classical test theory item/person statistics
T. G. Courville

Annual College of Education Educational Research Exchange


A Comparison Between Item Analysis Based on Item Response Theory and Classical Test Theory. A Study of the SweSAT Subtest WORD
C. Stage

Classical Test Theory or Item Response Theory: The Swedish Experience
C. Stage

Automation and visualization of distractor analysis using SAS/GRAPH
C. H. Yu

Canadian Journal of Education
S. Robinson
  25    (2000)

Consumer Choice of Food Products and the Implications for Price Competition and Government Labeling Policy
E. M. Mojduszka and J. A. Caswell and J. M. Harris

Fitting Logistic IRTModels: Small Wonder
M. A. Garcia-Perez
The Spanish Journal of Psychology  2  74-94  (1999)

Classical And Rasch Analyses Of Dichotomously Scored Reading Comprehension Test Items
A. M. Zubairi and N. L. A. Kassim
Malaysian Journal of ELT Research  2    (2006)

Estimating PISA students on the IALS prose literacy scale
K. Yamamoto

Absolute Identification by Relative Judgment
N. Stewart and G. D. A. Brown and N. Chater
Psychological Review  112  881-911  (2005)

Estimation for the Rasch Model under a linkage structure: a case study
V. Cazievel

Performance Benefits of Simultaneous over Sequential Menus As Task Complexity Increases
H. Hochheiser and B. Shneiderman

Book Review: Developing and Validating Multiple-Choice Test Items (3rd ed.)
J. E V Smith
Applied Psychological Measurement  30  69-72  (2006)

Verification of Cognitive Attributes Required to Solve the TIMSS-1999 Mathematics Items for Taiwanese Students
Y. Chen and J. Gorin and M. Thompson

Bayesian hierarchical analysis of polytomous item responses
K. Shigemasu and O. Yoshimura and T. Nakamura
Behaviormetrika  27  51-65  (2000)

What respondents learn from questionnaires: The survey interview and the logic
N. Schwarz
International Statistical Review  63  153-177  (1995)

T. G. K. Bryce
British Educational Research Journal  7    (1981)

The Multidimensional Random Coefficients Multinomial Logit Model
R. J. Adams and M. Wilson and W. Wang
Applied Psychological Measurement  21  1-24  (1997)

Equating errors in international surveys in education
C. Monseur and H. Sibbens and D. Hastedt

The Multidimensional Measure of Conceptual Complexity
N. J. S. Brown

A computer-aided environment for generating multiple-choice test items
R. Mitkov and L. A. Ha and N. Karamanis
Natural Language Engineering  1  1-17  (2005)

Modelling Mathematics Problem Solving Item Responses Using a Multidimensional IRT Model
M. Wu and R. Adams
Mathematics Education Research Journal  18  93-113  (2006)

A Longitudinal Study of Student Understanding of Chance and Data
J. Watson and B. Kelly
Mathematics Education Research Journal  18  40-55  (2006)

A Case of the Inapplicability of the Rasch Model: Mapping Conceptual Learning
K. Stacey and V. Steinle
Mathematics Education Research Journal  18  77-92  (2006)

Surveying Primary Teachers about Compulsory Numeracy Testing: Combining Factor Analysis with Rasch Analysis
P. Grimbeek and S. Nisbet
Mathematics Education Research Journal  18  27-39  (2006)

Easier Analysis and Better Reporting: Modelling Ordinal Data in MEducation Research
B. Doig and S. Groves
Mathematics Education Research Journal  18  56-76  (2006)

Applying the Rasch Rating Scale Model to Gain Insights into Students' Conceptualisation of Quality Mathematics Instruction
K. Bradley and S. Sampson and K. Royal
Mathematics Education Research Journal  18  11-26  (2006)

A Manual for Conducting Analyses with Data from TIMSS and PISA
J. D. Willms and T. Smith

Co-inertia analysis and the linking of ecological data tables
S. Dray and D. Chessel and J. Thioulouse
Ecology  84  3078-3089  (2003)

Random coefficient models for multilevel analysis
J. de Leeuw and I. Kreft
Journal of Educational Statistics  11  57-85  (1986)

John W. Tukey's contributions to multiple comparisons
Y. Benjamini and H. Braun
The Annals of Statistics  30  1576-1594  (2002)

Multivariate data analysis: The french way
S. Holmes

WINMIRA -- program description and recent enhancements
M. von Davier
Methods of Psychological Research - Online  2  25-28  (1997)

Version abrégée de la severe impairment battery (SIB)
L. Hugonot-Diener and M. Verny and E. Devouche and J. Saxton and P. Mecocci and F. Boller
Psychologie \& Neuropsychiatrie du Vieillissement  1  273-283  (2003)

Canadian Journal of Education

  25    (2000)

Mesures objectives de traits latents
J. Antonietti

Comment s'assurer de l'alignement d'un ensemble d'items
J. Antonietti

Designs de testage incomplets et modèle non-paramétrique de la réponse à l'item
J. Antonietti

Comment mesurer la similarité entre deux stuctures factorielles latentes
J. Antonietti

SAS macros for Rasch based latent variable modelling
K. B. Christensen and J. B. Bjorner

Latent Covariates in Generalized Linear Models
K. B. Christensen and M. L. Nielsen and L. Smith-Hansen

Forecasting the political behavior of leaders with the verbs in context system of operational code analysis
S. G. Walker

Application de l'analyse factorielle multiple pour le traitement de caractères en échelle dans les enquêtes
S. Camiz and J. Pagès

On local estimating equations in additive multiparameter models
G. Claeskens and M. Aerts

Variable Selection and Principal Component Analysis
N. Al-Kandari

A Comparative Study of Principal Component Analysis Techniques
R. A. Calvo and M. Partridge and M. A. Jabri

Applied Latent Class Analysis
L. A. Goodman

Intraobserver and interobserver reliability of the R/D score for evaluation of iris configuration by ultrasound biomicroscopy, in patients with pigment dispersion syndrome
M. O. Balidis and C. Bunce and K. Boboridis and J. Salzman and R. P. L. Wormald and M. H. Miller
Eye  16  722-726  (2002)

Selecting the number of response categories for a Lickert-type scale
N. J. Birkett

Using item response theory to explore the psychometric properties of extended matching questions examination in undergraduate medical education
B. Bhakta and A. Tennant and M. Horton and G. Lawton and D. andrich
BMC Medical Education  5    (2005)

The improved Clinical Global Impression Scale (iCGI): development and validation in depression
A. Kadouri and E. Corruble and B. Falissard
BMC Psychiatry  7    (2007)

The Adolescent Depression Rating Scale (ADRS): a validation study
A. Revah-Levy and B. Birmaher and I. Gasquet and B. Falissard
BMC Psychiatry  7    (2007)

Independent Factor Discriminant Analysis
A. Montanari and D. G. Calo and C. Viroli

Computational strategies for multivariate linear mixed-effects models with missing values
J. L. Schafer and R. M. Yucel
Journal of Computational and Graphical Statistics  11  437-457  (2002)

Graphical Representation of Multidimensional Item Response Theory Analyses
T. Ackerman
Applied Psychological Measurement  20  311-329  (1996)

Calculation of the Kappa Statistic for Inter-rater Reliability: The Case Where Raters Can Select Multiple Responses from a Large Number of Categories
C. R. Stein and R. B. Devore and B. E. Wojcik

Statistics and Probability
J. de Leeuw

The pseudoscience of psychometry and the Bell Curve
J. L. Graves
Journal of Negro Education  64  277-  (1995)

A study of Raven Standard Progressive Matrices test's item measures under classic and item response models: An empirical comparison
N. Cikrikci-Demirtasli

Asymmetric Loss Functions and Sample Size Determination: A Bayesian Approach
H. P. Stüger
Austrian Journal of Statistics  35  57-66  (2006)

Evaluation des compétences en mathématiques en fin de 2e année primaire
J. Antonietti and N. Guignard and A. Mudry and L. Ntamakiliro and W. Rieben and C. T. Christinat and A. V. der Klink

Fidélité et validité de la version française du "Children of Alcoholics Screening Test" (CAST)
H. Charland and G. Côté
Revue québécoise de psychologie  17  45-62  (1996)

Algorithmes et codes R pour la méthode de la pseudo-vraisemblance empirique dans les sondages
C. Wu
Techniques d'enquête  31  261-266  (2005)

Checking for Nonresponse Bias in Web-Only Surveys of Special Populations using a Mixed-Mode (Web-with-Mail) Design
B. J. Grim and L. M. Semali

Reliability Generalization of self-report of emotions when using the Differential Emotions Scale
E. A. Youngstrom and K. W. Green
Educational and Psychological Measurement  62    (2002)

Assessing the reliability of Beck Depression Inventory scores: Reliability Generalization across studies
P. Yin and X. Fan
Educational and Psychological Measurement  60  201-223  (2000)

Reliability Generalization of the Life Satisfaction Index
K. A. Wallace and A. J. Wheeler
Educational and Psychological Measurement  62    (2002)

Measurement error in "Big Five Factors" personality assessment: Reliability Generalization across studies and measures
C. Viswesvaran and D. Ones
Educational and Psychological Measurement  60  224-235  (2000)

Reliability generalization: Exploring reliability variations on MMPI/MMPI-2 Validity scale scores
T. Vacha-Haase and C. R. Tani and L. R. Kogan and R. A. Woodall and B. Thompson
Assessment  8  391-401  (2001)

Reliability generalization: Exploring reliability coefficients of MMPI clinical scales scores
T. Vacha-Haase and L. Kogan and C. R. Tani and R. A. Woodall
Educational and Psychological Measurement  61  45-59  (2001)

Reliability Generalization: Moving toward improved understanding and use of score reliability
T. Vacha-Haase and R. K. Henson and J. Caruso
Educational and Psychological Measurement  62    (2002)

Reliability generalization: Exploring variance in measurement error affecting score reliability across studies
T. Vacha-Haase
Educational and Psychological Measurement  58  6-20  (1998)

Stability of the reliability of LibQUAL+TM scores: A "Reliability Generalization" meta-analysis study
B. Thompson and C. Cook
Educational and Psychological Measurement  62    (2002)

A Reliability Generalization study of select measures of adult attachment style
R. J. Reese and K. M. Kieffer and B. K. Briggs
Educational and Psychological Measurement  62    (2002)

Reliability Generalization: An examination of the Career Decision-making Self-efficacy Scale
J. E. Nilsson and C. K. Schmidt and W. D. Meek
Educational and Psychological Measurement  62    (2002)

Expanding reliability generalization methods with KR-21 estimates: An RG study of the Coopersmith Self-esteem Inventory
G. G. Lane and A. E. White and R. K. Henson
Educational and Psychological Measurement  62    (2002)

A Reliability Generalization study of the Geriatric Depression Scale (GDS)
K. M. Kieffer and R. J. Reese
Educational and Psychological Measurement  62    (2002)

Characterizing measurement error in scores across studies: Some recommendations for conducting "Reliability Generalization" (RG) studies
R. K. Henson and B. Thompson

Given the potential value of reliability generalization (RG) studies in the development of cumulative psychometric knowledge, the purpose of this paper is to provide a tutorial on how to conduct such studies and to serve as a guide for researchers wishing to use this methodology. After some brief comments on classical test theory, the paper provides a practical framework for structuring an RG study, including: (1) test selection with an eye toward frequency of test use and reporting practices by authors; (2) development of a coding sheet that will capture potential variation in score reliability across studies; (3) procedural recommendations regarding data collection; (4) identification and use of potential dependent variables; and (5) application of general linear model analyses to the data.
A reliability generalization study of the Teacher Efficacy Scale and related instruments
R. K. Henson and L. R. Kogan and T. Vacha-Haase
Educational and Psychological Measurement  61    (2001)

Variability and prediction of measurement error in Kolb's Learning Style Inventory scores: A reliability generalization study
R. K. Henson and D. Hwang
Educational and Psychological Measurement  62    (2002)

Another meta-analysis of the White Racial Identity Attitude Scale's Cronbach alphas: Implications for validity
J. E. Helms
Measurement and Evaluation in Counseling and Development  32  122-137  (1999)

Reliability Generalization of Working Alliance Inventory scale scores
W. E. Hanson and K. T. Curry and D. L. Bandalos
Educational and Psychological Measurement  62    (2002)

Reliability: Arguments for multiple perspectives and potential problems with generalization across studies
D. M. Dimitrov
Educational and Psychological Measurement  62    (2002)

An examination of the reliability of scores from Zuckerman's Sensation Seeking Scales
H. K. Deditius-Island and J. C. Caruso
Educational and Psychological Measurement  62    (2002)

Reliability of scores from the Eysenck Personality Questionnaire: A Reliability Generalization (RG) study
J. C. Caruso and K. Witkiewitz and A. Belcourt-Dittloff and J. Gottlieb
Educational and Psychological Measurement  61  675-682  (2001)

Reliability Generalization of the Junior Eysenck Personality Questionnaire
J. C. Caruso and S. Edwards
Personality and Individual Differences  31  173-184  (2001)

A reliability generalization was conducted on the Psychoticism (P), Extraversion (E), Neuroticism (N) and Lie (L) scales of the Junior Eysenck Personality Questionnaire (J-EPQ). Twenty-three studies provided data on 44 samples of children who had been administered the J-EPQ. Score reliability was found to vary significantly both between and within scales. N and L provided the most reliable scores (with median reliabilities of 0.80 and 0.79 respectively) followed by E (median RELIABILITY=0.73) and P (median RELIABILITY=0.68). Scale length was the best predictor of score reliability, but sample gender makeup, language of administration, and the amount of variation in the ages of children in each sample were also significant predictors of reliability for various J-EPQ scales. The results highlight the importance of considering reliability to be a property of scores for a particular group, as opposed to a property of a test generally.
Reliability Generalization of the NEO personality scales
J. C. Caruso
Educational and Psychological Measurement  60  236-254  (2000)

Myers-Briggs Type Indicator score reliability across studies: A meta-analytic Reliability Generalization study
R. M. Capraro and M. M. Capraro
Educational and Psychological Measurement  62  659-673  (2002)

Effect sizes and F ratios < 1.0
M. C. Voelkle and P. L. Ackerman and W. W. Wittmann
Methodology  3  35-46  (2007)

Standard statistics texts indicate that the expected value of the F ratio is 1.0 (more precisely: N/(N-2)) in a completely balanced fixed-effects ANOVA, when the null hypothesis is true. Even though some authors suggest that the null hypothesis is rarely true in practice (e.g., Meehl, 1990), F ratios < 1.0 are reported quite frequently in the literature. However, standard effect size statistics (e.g., Cohen's f) often yield positive values when F < 1.0, which appears to create confusion about the meaningfulness of effect size statistics when the null hypothesis may be true. Given the repeated emphasis on reporting effect sizes, it is shown that in the face of F < 1.0 it is misleading to only report sample effect size estimates as often recommended. Causes of F ratios < 1.0 are reviewed, illustrated by a short simulation study. The calculation and interpretation of corrected and uncorrected effect size statistics under these conditions are discussed. Computing adjusted measures of association strength and incorporating effect size confidence intervals are helpful in an effort to reduce confusion surrounding results when sample sizes are small. Detailed recommendations are directed to authors, journal editors, and reviewers.
Measurement error of scores on the Mathematics Anxiety Rating Scale across studies.
M. M. Capraro and R. M. Capraro and R. K. Henson
Educational and Psychological Measurement  61  373-386  (2001)

Using mixed-effects models in Reliability Generalization studies
S. N. Beretvas and D. A. Pastor
Educational and Psychological Measurement  62    (2002)

A Reliability Generalization study of the Marlowe-Crowne Social Desirability Scale
S. N. Beretvas and J. L. Meyers and W. L. Leite
Educational and Psychological Measurement  62    (2002)

Reliability Generalization of scores on the Speilberger State-trait Anxiety Inventory
L. L. B. Barnes and D. Harp and W. S. Jung
Educational and Psychological Measurement  62    (2002)

R2: A computer program for interval estimation, power calculation, and hypothesis testing for the squared multiple correlation
J. H. Steiger and R. T. Fouladi
Behavior Research Methods, Instruments, and Computers  4  581-582  (1992)

Inference by eye: Confidence intervals, and how to read pictures of data
G. Cumming and S. Finch
American Psychologist      (2008)

A primer on the understanding, use and calculation of confidence intervals that are based on central and noncentral distributions
G. Cumming and S. Finch
Educational and Psychological Measurement  61  532-575  (2001)

Approximate confidence intervals for effect sizes
J. Algina and H. J. Keselman
Educational and Psychological Measurement  63  537-553  (2003)

How to estimate and interpret various effect sizes
T. Vacha-Haase and B. Thompson
Counseling Psychology  51  473-481  (2004)

Complementary methods for research in education
B. Thompson

Research in organizations: Foundational principles, processes, and methods of inquiry
B. Thompson

What future quantitative social science research could look like: Confidence intervals for effect sizes
B. Thompson
Educational Researcher  31  24-31  (2002)

"Statistical," "practical," and "clinical": How many kinds of significance do counselors need to consider?
B. Thompson
Journal of Counseling and Development  80  64-71  (2002)

Evaluating results using corrected and uncorrected effect size estimates
P. Snyder and S. Lawson
Journal of Experimental Education  61  334-349  (1993)

The handbook of research synthesis
R. Rosenthal

Measures of effect size for comparative studies: Applications, interpretations, and limitations
S. Olejnik and J. Algina
Contemporary Educational Psychology  25  241-286  (2000)

Beyond significance testing: Reforming data analysis methods in behavioral research
R. Kline

Handbook of research methods in experimental psychology
R. E. Kirk
    83-105  (2003)

Practical significance: A concept whose time has come
R. Kirk
Educational and Psychological Measurement  56  746-759  (1996)

Higher education: Handbook of theory and research
C. R. Hill and B. Thompson
  19  175-196  (2004)

Effect size for ANOVA designs
J. M. Cortina and H. Nouri

The Concept of Statistical Hypothesis Testing
B. Thompson
Measurement Update  4  5-6  (1994)
Five methodology errors in educational research: The pantheon of statistical significance and other faux pas
B. Thompson
Common methodology mistakes in educational research, revisited, along with a primer on both effect sizes and the bootstrap
B. Thompson
Statistical significance and effect size reporting: Portrait of a possible future
B. Thompson
Research in the Schools  5  33-38  (1998)

A confirmatory factor analysis of the Threat Index
M. K. Moore and R. A. Neimeyer
Journal of Personality and Social Psychology  60  122-129  (1991)

The Threat Index (TI), a measure of death concern grounded in personal construct theory, was submitted to psychometric refinement. The factorability of the TI using the traditional split-match scoring was compared with methods based on Manhattan, Euclidian, standardized Euclidian, and Mahalanobis distance formulas. Statistical and substantive interpretability were enhanced with the standardized Euclidian factor structure. The LISREL VI program was used to determine the best model for the scale in an exploratory factor analysis. A nonhierarchical, G + 3 model met the criterion of goodness of fit >0.9 for the 1st subsample (n = 405). In a confirmatory factor analysis with a 2nd subsample (n = 405), the model was confirmed. Internal consistency and test-retest reliability were acceptable for Global Threat and 3 subfactors--Threat to Well-Being, Uncertainty, and Fatalism--and all subfactors were found to be independent of social desirability.
Random effects modeling of categorical response data
A. Agresti and J. G. Booth and J. P. Hobert and B. Caffo

Estimation of the paramters of the Birnbaum-Saunders distribution
S. G. From and L. Li
Communications in Statistics -- Theory and Methods  35  2157-2169  (2006)

Variance decomposition using an IRT measurement model
S. M. van den Berg and C. A. W. Glas and D. I. Boomsma
Behavioral Genetics  37  604-616  (2007)

A note on multivariate Gauss-Hermite quadrature
P. Jäckel

Conceptual and psychometric framework for distinguishing categories and dimensions
P. D. Boeck and M. Wilson and G. S. Acton
Psychological Review  112  129-158  (2005)

Resampling methods for sample survey
B. Presnell and J. G. Booth

Numerical integration in logistic-normal models
J. Gonz\'{a}lez and F. Tuerlinckx and P. D. Boeck and R. Cools
Computational Statistics \& Data Analysis  51  1535-1548  (2006)

Locally dependent latent trait model for polytomous responses with application to inventory of hostility
E. H. Ip and Y. J. Wang and P. D. Boeck
Psychometrika  69  191-216  (2004)

Application of item response theory models for longitudinal data
D. Hedeker and R. J. Mermelstein and B. R. Flay

Confirmatory analyses of componential test structure using multidimensional item response theory
R. Janssen and P. D. Boeck
Multivariate Behavioral Research  34  245-268  (1999)

Fast robust logistic regression for large sparse datasets with binary outputs
P. R. Komarek and A. W. Moore

Factor-analyzing Likert-scale data under the assumption of mutlivariate normality complicates a meaningful comparison of observed groups or latent classes
G. Lubke and B. Muth\'{e}

Models for ordinal hierarchical classes analysis
I. Leenen and I. V. Mechelen and P. D. Boeck
Psychometrika  66  389-404  (2001)

A taxonomy of latent structure assumptions for probability matrix decomposition models
M. Meulders and P. D. Boeck and I. V. Mechelen
Psychometrika  68  61-77  (2003)

Latent variable models for partially ordered responses and trajectory analysis of anger-related feelings
M. Meulders and E. H. Ip and P. D. Boeck
British Journal of Mathematical and Statistical Psychology  58  117-143  (2005)

Bayesian inference for ordinal data using multivariate probit models
E. Lawrence and D. Bingham and C. Liu and V. N. Nair

On choosing a model for measuring
M. Wilson
Methods of Psychological Research Online  8  1-22  (2003)

Analysing social science data with graphical Markov models
N. Wermuth

Generating item responses for balanced-incomplete-block (BIB) design using the generalized partial credit model (GPCM)
B. S. Tay-Lim

Very Simple Structure: An alternative procedure for estimating the optimal number of interpretable factors
W. Revelle and T. Rocklin
Multivariate Behavioral Research  14  403-414  (1979)

The development, calibration, and inferential validation of standards-based assessments for english as a first foreign language at the IQB
A. A. Rupp and M. Vock and C. Harsch

Using patient characteristics and attitudinal data to identify depression treatment preference groups: A latent-class model
J. A. Thacher and E. Morey and W. E. Craighead

Differential item functionning and health assessment
J. Teresi

Non parametric item response theory with SAS and Stata
J. Hardouin
Journal of Statistical Software      (2007)

Rasch measurement in the assessment of amytrophic lateral sclerosis patients
J. M. Norquist and R. Fitzpatrick and C. Jenkinson
Journal of Applied Measurement  4  249-257  (2003)

Visions of 70 years of psychometrics: the past, present, and future
P. J. F. Groenen and L. A. van der Ark
Statistica Neerlandica  60  135-144  (2006)

On the analysis of bayesian semiparametric IRT-type models
E. S. Martin and A. Jara and J. Rolin and M. Mouchart

Bayesian modification indices for IRT models
J. Fox and C. A. W. Glas
Statistica Neerlandica  59  95-106  (2005)

Conditional independence of multivariate binary data with an application in caries research
M. J. Garcia-Zattera and A. Jara and E. Lesaffre and D. Declerck

Using person fit in a body of work standard setting
M. Finkelman and W. Kim

Modèle de Rasch et validation de questionnaires de qualité de vie
A. Hamon

Statistical Methods for Quality of Life Studies. Design, Measurement and Analysis
A. Hamon and M. Mesbah

Intégration de la théorie de la réponse aux items aux modèles par équations structurelles: Comparaison avec une régression fondée sur des scores TRI
D. R. Thomas and I. R. R. Lu and B. D. Zumbo

Multilevel IRT: A bayesian perspective on estimating parameters and testing statistical hypotheses
J. Fox

Latent class and finite mixture models for multilevel data sets
J. K. Vermunt
Statistical Methods in Medical Research      (2007)

Modeling joint and marginal distributions in the analysis of categorical panel data
J. K. Vermunt and M. F. Rodrigo and M. Ato-Garcia
Sociological Methods and Research  30  170-196  (2001)

The use restricted latent class models for defining and testing nonparametric and parame tric IRT models
J. K. Vermunt
Applied Psychological Measurement  25  283-294  (2001)

Stochastic Ordering of the Latent Trait by the Sum Score Under Various Polytomous IRT Models
L. A. van der Ark
Psychometrika  70  283-304  (2005)

Multilevel IRT using dichotomous and polytomous response data
J. Fox
British Journal of Mathematical and Statistical Psychology  58  145-172  (2005)

The hierarchy consistency index: A person-fit statistic for the attribute hierarchy model
Y. Cui and J. P. Leighton and M. J. Gierl and S. M. Hunka

Outlier detection in test and questionnaire data
W. P. Zijlstra and L. A. van der Ark and K. Sijtsma
Multivariate Behavioral Research      (2007)

Multiple Imputation of Item Scores in Test and Questionnaire Data, and Influence on Psychometric Results
J. R. van Ginkel and L. A. van der Ark and K. Sijtsma
Multivariate Behavioral Research  42  387-414  (2007)

The performance of five simple multiple imputation methods for dealing with missing data were compared. In addition, random imputation and multivariate normal imputation were used as lower and upper benchmark, respectively. Test data were simulated and item scores were deleted such that they were either missing completely at random, missing at random, or not missing at random. Cronbach's alpha, Loevinger's scalability coefficient H, and the item cluster solution from Mokken scale analysis of the complete data were compared with the corresponding results based on the data including imputed scores. The multiple-imputation methods, two-way with normally distributed errors, corrected item-mean substitution with normally distributed errors, and response function, produced discrepancies in Cronbach's coefficient alpha, Loevinger's coefficient H, and the cluster solution from Mokken scale analysis, that were smaller than the discrepancies in upper benchmark multivariate normal imputation.
Multiple imputation for item scores when test data are factorially complex
J. R. van Ginkel and L. A. van der Ark and K. Sijtsma
British Journal of Mathematical and Statistical Psychology      (2007)

Multiple imputation under a two-way model with error is a simple and effective method that has been used to handle missing item scores in unidimensional test and questionnaire data. Extensions of this method to multidimensional data are proposed. A simulation study is used to investigate whether these extensions produce biased estimates of important statistics in multidimensional data, and to compare them with lower benchmark listwise deletion, two-way with error and multivariate normal imputation. The new methods produce smaller bias in several psychometrically interesting statistics than the existing methods of two-way with error and multivariate normal imputation. One of these new methods clearly is preferable for handling missing item scores in multidimensional test data.
Multilevel modeling of cognitive function in schizophrenic patients and their first degree relatives
S. Rabe-Hesketh and T. Toulopoulou and R. M. Murray
Multivariate Behavioral Research  36  279-298  (2001)

Factor analysis of the Dutch-Language version of the MCMI-III
G. Rossi and L. A. van der Ark and H. Sloore
Journal of Personality Assessment  88  144-157  (2007)

Randomized item response theory models
J. Fox
Journal of Educational and Behavioral Statistics  30  1-24  (2005)

Using item response theory to measure extreme response style in marketing research: A global investigation
M. G. D. Jong and J. B. E. M. Steenkamp and J. Fox
Journal of Marketing Research      (2007)

Instability of person misfit and ability estimates subject to assessment modality
A. Petridou and J. Williams

Mathematical methods for survival analysis, reliability and Quality of life
J. Hardouin and M. Mesbah

Maximum likelihood estimation of generalized linear model with covariate measurement error
S. Rabe-Hesketh and A. Skrondal and A. Pickles
The Stata Journal  1    (2001)

Modelling Response Error in School Effectiveness Research
J. Fox
Statistica Neerlandica  58  138-160  (2004)

Applications of Multilevel IRT Modeling
J. Fox
School Effectiveness and School Improvement  15  261-280  (2004)

Stochastic EM for Estimating the Parameters of a Multilevel IRT Model
J. Fox
British Journal of Mathematical and Statistical Psychology  56  65-81  (2003)

Bayesian Estimation of a Multilevel IRT Model using Gibbs Sampling
J. Fox and C. A. W. Glas
Psychometrika  66  269-286  (2001)

Bayesian Modeling of Measurement Error in Predictor Variables Using Item Response Theory
J. Fox and C. A. W. Glas

A Dirichlet process mixture model for the analysis of correlated binary responses
A. Jara and M. J. Garcia-Zattera and E. Lesaffre
Computational Statistics \& Data Analysis  51  5402-5415  (2007)

The multivariate probit model is a popular choice for modelling correlated binary responses. It assumes an underlying multivariate normal distribution dichotomized to yield a binary response vector. Other choices for the latent distribution have been suggested, but basically all models assume homogeneity in the correlation structure across the subjects. When interest lies in the association structure, relaxing this homogeneity assumption could be useful. The latent multivariate normal model is replaced by a location and association mixture model defined by a Dirichlet process. Attention is paid to the parameterization of the covariance matrix in order to make the Bayesian computations convenient. The approach is illustrated on a simulated data set and applied to oral health data from the Signal Tandmobiel^(R) study to examine the hypothesis that caries is mainly a spatially local disease.
Analyzing incomplete political science data: An alternative algorithm for multiple imputation
G. King and J. Honaker and A. Joseph and K. Scheve
American Political Science Review  95  49-69  (2001)

Some applications of generalized linear latent and mixed models in epidemiology: Repeated measures, measurement error and multilevel modeling
A. Skrondal and S. Rabe-Hesketh
Norsk Epidemiologi  13  265-278  (2003)

A new statistic to detect misfitting score vector
J. B. Pornel and L. S. Sotaridona and A. L. Vallejo

Two-way imputation: A bayesian method for estimating missing scores in tests and questionnaires, and an accurate approximation
J. R. van Ginkel and L. A. van der Ark and K. Sijtsma and J. K. Vermunt
Computational Statistics \& Data Analysis  51  4013-4027  (2007)

Parametrization of multivariate random effects models for categorical data
S. Rabe-Hesketh and A. Skrondal
Biometrics  57  1256-1264  (2001)

Mokken scale analysis using hierarchical clustering procedures
A. A. H. van Abswoude and J. K. Vermunt and B. T. Hemker and L. A. van der Ark
Applied Psychological Measurement  28  332-354  (2004)

A comparative study of test data dimensionality assessment procedures under nonparametric IRT models
A. A. H. van Abswoude and L. A. van der Ark and K. Sijtsma
Applied Psychological Measurement  28  3-24  (2004)

Relationships and properties of polytomous item response theory models
L. A. van der Ark
Applied Psychological Measurement  25  273-282  (2001)

Cross-classification multilevel logistic models in psychometrics
W. V. den Noortgate and P. D. Boeck and M. Meulders
Journal of Educational and Behavioral Statistics  28  369-386  (2003)

A relation between a between-item multidimensional IRT model and the mixture-Rasch model
F. Rijmen and P. D. Boeck
Psychometrika  70  481-496  (2005)

IRT models for ability-based guessing
E. S. Martin and G. del Pino
Applied Psychological Measurement  30  183-203  (2006)

Estimation of the MIRID: A program and a SAS based approach
D. J. M. Smits and P. D. Boeck and N. Verhelst
Behavior Research Methods, Instruments, & Computers  35  537-549  (2003)

Some Mantel-Haenszel tests of Rasch model assumptions
T. Verguts and P. D. Boeck
Journal of Mathematical & Statistical Psychology  54  21-37  (2001)

A note on the Martin-Löf test for unidimensionality
T. Verguts and P. D. Boeck
Methods of Psychological Research - Online  5  77-82  (2000)

Non-modeled item interactions can lead to distorted discrimination parameters: A case study
F. Tuerlinckx and P. D. Boeck
Methods of Psychological Research - Online  6  159-174  (2001)

Statistical inference in generalized linear mixed models: A review
F. Tuerlinckx and F. Rijmen and G. Verbeke and P. D. Boeck
British Journal of Mathematical & Statistical Psychology  59  225-255  (2006)

Two interpretations of the discrimination parameter
F. Tuerlinckx and P. D. Boeck
Psychometrika  70  629-650  (2005)

In this paper we propose two interpretations for the discrimination parameter in the two-parameter logistic model (2PLM). The interpretations are based on the relation between the 2PLM and two stochastic models. In the first interpretation, the 2PLM is linked to a diffusion model so that the probability of absorption equals the 2PLM. The discrimination parameter is the distance between the two absorbing boundaries and therefore the amount of information that has to be collected before a response to an item can be given. For the second interpretation, the 2PLM is connected to a specific type of race model. In the race model, the discrimination parameter is inversely related to the dependency of the information used in the decision process. Extended versions of both models with person-to-person variability in the difficulty parameter are considered. When fitted to a data set, it is shown that a generalization of the race model that allows for dependency between choices and response times (RTs) is the best-fitting model.
H. H. F. M. Verstralen and N. D. Verhelst and T. M. Bechger

The administration of tests via the computer allows the registration of response times along with the actual response. This paper describes a model that combines these two kinds of data to estimate a subject latent variable usually called mental speed, but more appropriately called mental power. The model implies that the expected item score increases with invested time. Nevertheless, it allows for a decreasing expected item score with response time, which is sometimes found in experiments. This paradox is obtained by assuming that a subject not only stops working on a problem because of time pressure, but also when he has solved the problem. The model builds on a familiar framework of IRT models. An MML estimation procedure is developed, and model fit on the item level is evaluated using Lagrange multiplier tests.
A Latent IRT Model for Options of Multiple Choice Items
H. H. F. M. Verstralen

A latent IRT model for the analysis of multiple choice questions is proposed. The incorrect options of an item are associated with a decreasing logistic function that models the probability of being judged correct. It is assumed that the correct option is always recognized as such. According to the model a subject selects randomly from the subset of options considered correct. Like its companion treated in Verstralen (1997) the model can be viewed as a generalization of Nedelsky's (1954) method to determine a pass/fail score. With this other model it has in common that the ML latent variable estimator gains some precision compared to binary scoring. Both models also share some other favorable psychometric properties.
A Logistic Latent Class Model for Multiple Choice Items
H. H. F. M. Verstralen

A logistic latent class model for the analysis of options of a class of multiple choice items is presented. For each item a set of latent classes with a chain structure is assumed. The probability of latent class membership is modeled by a logistic function. The conditional probability of the observed response, the selection of an option, given the latent class membership is assumed to be constant. The model can be viewed as a generalization of Nedelsky's (1954) method to determine a pass/fail score. Apart from giving a more detailed model on the process of solving a multiple choice item an increase in the precision of latent variable estimates in comparison with binary scoring is achieved. The model is shown to possess some favorable psychometric properties.
A Selection Procedure for Polytomous Items in Computerized Adaptive Testing
P. W. van Rijn and T. J. H. M. Eggen and B. T. Hemker and P. F. Sanders

In the present study, a procedure which was developed to select dichotomous items in computerized adaptive testing was applied to polytomous items. The aim of this procedure is to select the item with maximum weighted information. In a simulation study, the item information function was integrated over a fixed interval of ability values and the item with the maximum area was selected. This maximum interval information item selection procedure was compared to a maximum point information item selection procedure. No substantial differences between the two item selection procedures were found when computerized adaptive tests were evaluated on bias and root mean square of the ability estimate.
About the Cluster Kappa Coefficient
T. M. Bechger and B. T. Hemker and G. K. J. Maris

The cluster kappa was proposed by Schouten (1982) as a measure of chance-corrected rater agreement suitable for studies where objects are rated on a categorical scale by two or more judges. We discuss a way to calculate the cluster kappa which is suited even if ratings are missing. Further, we demonstrate how the sampling error of the cluster kappa may be estimated.
J. van Ruitenburg

An Approximation of Cronbach's Î$\pm$ and its Use in Test Assembly
A. J. Verschoor

In this paper a new approximation of Cronbach's Î$\pm$ is presented. It is especially suited in the context of test assembly. Using this approximation, two test assembly models are introduced. Being non-linear models, they are solved by Genetic Algorithms as the commonly used Linear Programming methods cannot be used here. A comparison is made with existing test assembly models.
G. Maris and T. M. Bechger

Are attitude items monotone or single peaked? An analysis using bayesian methods
G. Maris

Clustering Nominal Data with Equivalent Categories: a Simulation Study Comparing Restricted GROUPALS and Restricted Latent Class Analysis
M. Hickendorff

Combining classical test theory and item response theory
T. Bechger and G. Maris and A. Béguin and H. Verstralen

Comparison of Test Administration Procedures for Placement Decisions in a Mathematics Course
G. J. J. M. Straetmans and T. J. H. M. Eggen

In this study, three different test administration procedures for making placement decisions in adult education were compared: a paper-based test (PBT), a computer-based test (CBT), and a computerized adaptive test (CAT). All tests were prepared from an item response theory calibrated item bank. The subjects were 90 volunteer students from three adult education schools. They were randomly assigned to one of six experimental groups to take two tests which differed in mode of administration. The results indicate that test performance was not differentially affected by the mode of administration and that the CAT always yielded more accurate ability estimates than the two other test administration procedures. The CAT was also found to be capable of making placement decisions with a test that was on average 24% shorter.
Computerize Adaptive Testing: What It Is and How It Works
G. J. J. M. Straetmans and T. J. H. M. Eggen

Concerning the identification of the 3PL model
G. Maris

Effect of Noncompensatory Multidimensionality on Separate and Concurrent estimation in IRT Observed Score Equating
A. A. Béguin and B. A. Hanson

In this article, the results of a simulation study comparing the performance of separate and concurrent estimation of a unidimensional item response theory (IRT) model applied to multidimensional noncompensatory data are reported. Data were simulated according to a two-dimensional noncompensatory IRT model for both equivalent and nonequivalent groups designs. The criteria used were the accuracy of estimating a distribution of observed scores, and the accuracy of IRT observed score equating. In general, unidimensional concurrent estimation resulted in lower or equivalent total error than separate estimation, although there were a few cases where separate estimation resulted in slightly less error than concurrent estimation. Estimates from the correctly specified multidimensional model generally resulted in less error than estimates from the unidimensional model. The results of this study, along with results from a previous study where data were simulated using a compensatory multidimensional model, make clear that multidimensionality of the data affects the relative performance of separate and concurrent estimation, although the degree to which the unidimensional model produces biased results with multidimensional data depends on the type of multidimensionality present.
Equivalent Linear Logistic Test Models
T. M. Bechger and H. H. F. M. Verstralen and N. D. Verhelst

This paper is about the Linear Logistic Test Model (LLTM). We demonstrate that there are infinitely many equivalent ways to specify a model . An implication is that there may well be many ways to change the specification of a given LLTM and achieve the same improvement in model fit. To illustrate this phenomenon we analyze a real data set us ing a Lagrange multiplier test for the specification of the model.
Equivalent mirid models
G. Maris and T. Bechger

Estimating the Reliability of a Test from a Single Test Administration
N. D. Verhelst

The article discusses methods of estimating the reliability of a test from a single test administration. In the first part a review of existing indices is given, supplemented with two heuristics to approximate Guttmanís λ4 and a new similar coefficient. Special attention is given to the greatest lower bound, to its meaning as well as to the problems in computing it. In the second part the relation between Cronbachís Î$\pm$ and the reliability is studied by means of a factorial model for the item scores. This part gives some useful formulae to appreciate the amount with which the reliability is underestimated when Î$\pm$ is used as its estimand. In the last part, the sampling distribution of the indices is investigated by means of two simulation studies, showing that the indices exhibit severe bias, the direction of which depends partly on the factorial structure of the test. For three indices the bias is modeled. The model describes the bias accurately for all cases studied in the simulation studies. It is shown how this bias correction may be applied in the case of a single data set.
Explorations in recursive designs
H. Verstralen

Starting from a set of basic designs, more complex designs are created by recursive application of the basic designs. Properties of these designs, and their effects on the accuracy of Rasch CML-parameter estimates are investigated.
G. Maris

Identifiability of Non-Linear Logistic Test Models
T. M. Bechger and N. D. Verhelst and H. H. F. M. Verstralen

The linear logistic test model (LLTM) specifies the item parameters as a weighted sum of basic parameters. The LLTM is a special case of a more general non-linear logistic test model (NLTM) where the weights are partially unknown. This paper is about the identifiability of the NLTM. Sufficient and necessary conditions for global identifiability are presented for a NLTM where the weights are linear functions, while conditions for local identifiability are shown to require less assumptions. It is also discussed how these conditions are checked using an algorithm due to Bekker, Merckens, and Wansbeek (1994). Several illustrations are given.
Infeasibility in Automated Test Assembly Models: A Comparison Study of Different Methods
H. A. Huitzing and B. P. Veldkamp and A. J. Verschoor

Several techniques exist to automatically put together a test meeting a number of specifications. In an item bank, the items are stored with their characteristics. A test is constructed by selecting a set of items that fulfills the specifications set by the test assembler. Test assembly problems are often formulated in terms of a model consisting of restrictions and an objective to be maximized or minimized. A problem arises when it is impossible to construct a test from the item pool that meets all specifications, that is, when the model is not feasible. Several methods exist to handle these infeasibility problems. In this paper, test assembly models resulting from two practical testing programs were reconstructed to be infeasible. These models were analyzed using methods that either forced a solution (Goal programming, Multiple-Goal programming, Greedy Heuristic), that analyzed the causes (Relaxed and Ordered Deletion Algorithm, Integer Randomized Deletion Algorithm, Set Covering and Item Sampling), or that analyzed the causes and used this information to force a solution (Irreducible-Infeasible-Set Solver). Specialized methods like the Integer Randomized Deletion Algorithm, and the Irreducible-Infeasible-Set-Solver performed best. Recommendations about the use of different methods are given.
IRT models for subjective weights of options of multiple choice questions.
H. H. F. M. Verstralen and N. D. Verhelst

From earlier investigations it was found that the information from Multiple Choice (MC) questions could be increased about four fold by having the subject indicate the subset of options that he is unable to expose as false. In the present models this approach is general ized by having the subject distribute a number of 'taws' over the options, or draw a line after the options, such that the number of taws given to an option, or the line length rejects its subjective degree of correctness. It appears that even with values of the relevant parameters that seem modest, the information relative to binary scoring still is in excess of two. This means that with less than half the test length the same accuracy or reliability can be obtained as with binary scoring. With a real data set we found a relative information greater than five. If a few main fallacies can be rejected in the distractors of the items, the model can be applied to identify subjects with one of these fallacies.
IRT Test Assembly Using Genetic Algorithms
A. J. Verschoor

This paper intro duces a new class of ptimisation methods in test assembly: Genetic Algorithms (GAs). In the first part an overview is given of the concepts and principles of GAs, in the second part they are applied to three commonly used test assembly models using Item Response Theory. Simulation studies are performed in order to find conditions under which GAs can be successfully used.
Item Selection in Adaptive Testing with the Sequential Probability Ratio Test
T. J. H. M. Eggen

Computerized adaptive tests (CATs) were originally developed to obtain an efficient estimate of an examinee's ability. For classification problems, applications of the Sequential Probability Ratio Test (Wald, 1947) have been shown to be a promising alternative for testing algorithms which are based on statistical estimation. However, the method of item selection currently being used in these algorithms, which use statistical testing to infer on the examinees, is either random or based on a criterion which is related to optimizing estimates of examinees (maximum (Fisher) information). In this study, an item selection method based on Kullback-Leibler information is presented, which is theoretically more suitable for statistical testing problems and which can improve the testing algorithm for classification problems. Simulation studies were conducted for two- and three-way classification problems, in which item selection based on Fisher information and Kullback-Leibler information were compared. The results of these studies showed that the performance of the testing algorithms with Kullback-Leibler information-based item selection are sometimes better and never worse than algorithms with Fisher information-based item selection.
Loss of Information in Estimating Item Parameters in Incomplete Designs
T. J. H. M. Eggen and N. D. Verhelst

In this paper, the efficiency of conditional maximum likelihood (CML) and marginal maximum likelihood (MML) estimation of the item parameters of the Rasch model in incomplete designs is studied. The use of the concept of F-information (Eggen, 2000) is generalized to incomplete testing designs. The standardized determinant of the F-information matrix is used for a scalar measure of information in a set of item parameters. In this paper, the relation between the normalization of the Rasch model and this determinant is clarified. It is shown that comparing estimation methods with the defined information efficiency is independent of the chosen normalization. In examples, information comparisons are conducted. It is found that for both CML and MML some information is lost in all incomplete designs compared to complete designs. A general trend is that with increasing test booklet length the efficiency of an incomplete to a complete design and also the efficiency of CML compared to MML is increasing. The main differences between CML and MML is seen in relation to the length of the test booklet. It will be demonstrated that with very small booklets, there is a substantial loss in information (about 35%) with CML estimation, while this loss is only about 10% in MML estimation. However, with increasing test length, the differences between CML and MML quickly disappear.
Modeling Sums of Binary Responses by the Partial Credit Model
N. D. Verhelst and H. H. F. M. Verstralen

The Partial Credit Model (PCM) is sometimes interpreted as a model for stepwise solution of polytomously scored items, where the item parameters are interpreted as difficulties of the steps. It is argued that this interpretation is not justified. A model for stepwise solution is discussed. It is shown that the PCM is suited to model sums of binary responses which are not supposed to be stochastically independent. As a practical result, a statistical test of stochastic independence in the Rasch model is derived.
On Measurement Properties of Continuation Ratio Models
B. T. Hemker and L. A. van der Ark and K. Sijtsma

Three classes of polytomous IRT models are distinguished. These classes are the adjacent category models, the cumulative probability models, and the continuation ratio models. So far, the latter class has received relatively little attention. The class of continuation ratio models includes logistic models, such as the sequential model (Tutz, 1990), and non-logistic models, such as the acceleration model (Samejima, 1995) and the nonparametric sequential model (Hemker, 1996). Four measurement properties are discussed. These are monotone likelihood ratio of the total score, stochastic ordering of the latent trait by the total score, stochastic ordering of the total score by the latent trait, and invariant item ordering. These properties have been investigated previously for the adjacent category models and the cumulative probability models, and for the continuation ratio models this is done here. It is shown that stochastic ordering of the total score by the latent trait is implied by all continuation ratio models, while monotone likelihood ratio of the total score and stochastic ordering on the latent trait by the total score are not implied by any of the continuation ratio models. Only the sequential rating scale model implies the property of invariant item ordering. Also, we present a Venn-diagram showing the relationships between all known polytomous IRT models from all three classes.
On the Loss of Information in Conditional Maximum Likelihood Estimation of Item Parameters
T. J. H. M. Eggen

In item response models of the Rasch type (Fischer & Molenaar, 1995), item parameters are often estimated by the conditional maximum likelihood (CML) method. This paper addresses the loss of information in CML estimation by using the information concept of F-information (Liang, 1983). This concept makes it possible to specify the conditions for no loss of information and to define a quantification of information loss. For the dichotomous Rasch model, the derivations will be given in detail to show the use of the F-information concept for making efficiency comparisons for different estimation methods. It is shown that by using CML for item parameter estimation, some information is almost always lost. But compared to JML (joint maximum likelihood) as well as to MML (marginal maximum likelihood) the loss is very small. The reported efficiency of CML to JML and to MML in several comparisons is always larger than 93%, and in tests with a length of 20 items or more, larger than 99%.
Optimal Testing With Easy or Difficult Items in Computerized Adaptive Testing
T. J. H. M. Eggen and A. J. Verschoor

Computerized adaptive tests (CATs) are individualized tests which, from a measurement point of view, are optimal for each individual, possibly under some practical conditions. In the present study it is shown that maximum information item selection in CATs using an item bank which is calibrated with the one- or the two-parameter logistic model, results in each individual answering about 50% of the items correctly. Two item selection procedures giving easier (or more difficult) tests for students are presented and evaluated. Item selection on probability points of items yields good results only with the 1pl model and not with the 2pl model. An alternative selection procedure, based on maximum information at a shifted ability level, gives satisfactory results with both models.
Overexposure and underexposure of items in computerized adaptive testing
T. J. H. M. Eggen

Computerized adaptive tests (CATS) have shown to be considerably more efficient than paper-and-pencil tests. This gain is realized by offering each candidate the most informative item from an available item bank on the basis of the results of items that have already been administered. The item selection methods that are used to compose an optimum test for each individual do, however, have a number of drawbacks. Though a CAT generally presents each candidate with a different test, it often occurs that some items from the item bank are administered very frequently while others are never or hardly ever used. These two problems, i.e., overexposure and underexposure of items, can be eliminated by adding further restrictions to the item selection methods. However, this exposure control will affect the efficiency of the CAT. This paper presents a solution for both problems. The functioning of these methods will be illustrated with the results of simulation research that has been carried out to develop adaptive tests.
Preferences for various learning environments: Teachers' and parents' perceptions
E. C. Roelofs and J. J. C. M. Visser

In the last ten years, a number of innovations, mainly inspired by constructivist notions of learning, have been introduced in various levels of the Dutch educational system. However, constructivist learning environments are rarely implemented. Teachers tend to stick to expository and structured learning environments. This consistent finding requires research in order to gain insight into teachersí preferences for learning environments and to determine the factors that support and impede the realization of these learning environments. Regarding the influence of social backgrounds on student learning, is it also important to take stock of parental views on learning environments. This study is focused on teachers' preferences for learning environments, their reported teaching behavior, and how these match with parents' preferences. Three parallel questionnaires were developed for teachers (n=281), students (n=952), and parents (n=717), measuring preferences and behavior in different levels of education, for three types of learning environments: direct instruction, discovery learning, and authentic pedagogy. The results show that teachers often prefer direct instruction, and seldom promote discovery learning. While teachers sometimes realize authentic pedagogy, constructive learning tasks are seldom used. Teachers' reported practice and parents' preferences for their children appear to correspond reasonably. Results of multiple regression analyses show that the use of the three types of learning environments yield different predictors. For the use of discovery learning and authentic pedagogy, confidence in students' regulative skills is an important predictor. In predicting the use of direct instruction, the teacher's own conception of learning turns out to be an important predictor.
T. M. Bechger and G. Maris

This paper is about the structural equation modelling of quantitative measures that are obtained from a multiple facet design. A facet is simply a set consisting of a finite number of elements. It is assumed that measures are obtained by combining each element of each facet. Methods and traits are two such facets, and a multitrait-multimethod study is a two-facet design. We extend models that were proposed for multitrait-multimethod data by Wothke (1984;1996) and Browne (1984, 1989, 1993), and demonstrate how they can be fitted using standard software for structural equation modelling. Each model is derived from the model for individual measurements in order to clarify the first principles underlying each model.
Testing the unidimensionality assumption of the Rasch model
N. Verhelst

Statistical tests especially designed to test the unidimensionality axiom of the Rasch model are scarce. For two of them, the Martin-Löf test (ML-test) and the splitter-item-technique, an extensive power analysis has been carried out , showing clearly the superiority of the ML-test. The disadvantage of the ML-test, however, is that its null distribution deviates strongly from the asymptotic chi-square distribution unless one has huge samples. A new test with one degree of freedom is proposed. Its power is superior to that of the ML-test, and its null distribution converges rapidly to the chi-square.
The Combined Use of Classical Test Theory and Item Response Theory
H. Verstralen and T. Bechger and G. Maris

The present paper is about a number of relations between concepts of models from classical test theory (CTT), such as reliability, and item response theory (IRT). It is demonstrated that the use of IRT models allows us to extend the range of applications of CTT, and investigate relations among concepts that are central in CTT such as reliability and item-test correlation.
The componential Nedelsky model: A first exploration
T. Bechger and G. Maris

The Nedelsky model for multiple choice items
T. Bechger and G. Maria and H. Verstralen and N. Verhelst

Two methods for the practical analysis of rating data
G. Maris and T. Bechger

Some Mantel-Haenszel tests of Rasch model assumptions
T. Verguts and P. D. Boeck
Journal of Mathematical and Statistical Psychology  54  21-37  (2001)

A note on the Martin-Löf test for unidimensionality
T. Verguts and P. D. Boeck
Methods of Psychological Research Online  5    (2000)

Classical test theory versus Rasch analysis for quality of life questionnaire reduction
L. Prieto and J. Alonso and R. Lamarca
Health and Quality of Life Outcomes  1    (2003)

Evaluating the dimensionality of the Michigan English Language Assessment Battery
H. Jiao
  2  27-52  (2004)

Nonlinear effects in generalized latent variable models
D. Rizopoulos

A Rasch and factor analysis of the Functional Assessment of Cancer Therapy-General (FACT-G)
A. B. Smith and P. Wright and P. J. Selby and G. Velikova
Health and Quality of Life Outcomes  5    (2007)

Critical eigenvalue sizes in standardized residual principal components analysis
G. Raîche
Rasch Measurement Transactions  19  1012  (2005)

Rasch analysis of the dimensional structure of the hospital anxiety and depression scale
A. B. Smith and E. P. Wright and R. Rush and D. P. Stark and G. Velikova and P. J. Selby
Psycho-Oncology      (2005)

Méthodes d'étude de l'adéquation au modèle logistique à un paramètre (modèle de Rasch)
A. Flieller
Mathématiques et Sciences Humaines  127  19-47  (1994)

Critical issues to address when applying Item Response Theory (IRT) models
M. Orlando

Fitting a response model for n dichotomously scored items
R. D. Bock and M. Lieberman
Psychometrika  35  179-197  (1970)

Assessment of reliability when test items are not essentially tau-equivalent
G. Socan

Exploring monotonicity in polytomous item response data
B. W. Junker

Nonparametric IRT in Action: An overview of the special issue
B. W. Junker and K. Sijtsma

An approach to multidimensional item response modeling
M. Linardakis and P. Dellaportas

Monotonicity and conditional independence in models for student assessment and attitude measurement
B. W. Junker

Measurement Decision Theory
L. M. Rudner

Maximal reliability for unit-weighted composites
P. M. Bentler

A survey of theory and methods of invariant item ordering
K. Sijtsma and B. W. Junker

Using logistic regression and the Mantel-Haenszel with multiple ability estimates to detect Differential Item Functioning
K. M. Mazor and A. Kanjee and B. E. Clauser
Journal of Educational Measurement  32  131-144  (1995)

When do item reponse function and Mantel-Haenszel definitions of Differential Item Functioning coincide?
R. Zwick
Journal of Educational Statistics  15  185-197  (1990)

The asymptotic bias of minimum trace factor analysis, with applications to the greatest lower bound to reliability
A. Shapiro and J. M. F. T. Berge
Psychometrika  65  413-425  (2000)

An empirical comparison of coefficient alpha, guttman's lambda-2, and MSPLIT maximized split-half reliability estimates
J. C. Callender and H. G. Osburn
Journal of Educational Measurement  16  89  (1979)

Algorithms for computerized test construction using classical item parameters
J. J. Adema and W. J. van der Linden
Journal of Educational Statistics  14  279-290  (1989)

Optimization of classical reliability in test construction
R. D. Armstrong and D. H. Jones
Journal of Educational and Behavioral Statistics  23  1-17  (1998)

La simulation d'un test adaptatif basé sur le modèle de Rasch
G. Raîche

The role of long-term memory in text comprehension
W. Kintsch and V. L. Patel and K. A. Ericsson
Psychologia  42  186-198  (1999)

Assessing leadership style: A trait analysis
M. G. Hermann

Latent semantic indexing: A probabilistic analysis
C. H. Papadimitriou and P. Raghavan and H. Tamaki

Indexing by latent semantic analysis
S. Deerwester and S. T. Dumais and R. Harshman
Journal of the American Society for Information Science  41  391-407  (1990)

Principal component analysis with binary data. Applications to roll-call analysis
J. de Leeuw

Probabilistic non-linear principal component analysis with gaussian process latent variables models
N. Lawrence
Journal of Machine Learning Research  6  1783-1816  (2005)

Objective scoring for computing competition tasks
G. Kemkes and T. Vasiga and G. Cormack

Computerized adaptive testing
T. Eggen

Sensitivity in metric scaling and analysis of distance
W. J. Krzanowski
Biometrics  62  239-244  (2006)

Probabilistic latent semantic analysis
T. Hofmann

Improving probabilistic latent semantic analysis with principal component analysis
A. Farahat and F. Chen

Un système d'observation et d'analyse en direct de séances d'enseignement
E. Allègre and P. Dessus

Using latent semantic analysis to assess knowledge: Some technical considerations
B. Rehder and M. E. Schreiner and M. B. W. Wolfe and D. Laham

Learning from text: Matching readers and texts by latent semantic analysis
M. B. W. Wolfe and M. E. Schreiner and B. Rehder and D. Laham
Discourse Processes  25  309-336  (1998)

The measurement of textual coherence with latent semantic analysis
P. W. Foltz and W. Kintsch and T. K. Landauer
Discourse Processes  25  285-307  (1998)

An introduction to latent semantic analyses
T. K. Landauer and P. W. Foltz and D. Laham
Discourse Processes  25  259-284  (1998)

Psychometric analyses based on evidence-centered design and cognitive science of learning to explore students' problem-solving in physics
C. Huang

The distinctive language of terrorists
E. Lazarevska and J. M. Sholl and M. Young

A handbook on the theory and methods of Differential Item Functioning (DIF)
B. D. Zumbo

Bayesian inference for categorical data analysis
A. Agresti and D. Hitchcock
Statistical Methods and Application (Journal of the Italian Statistical Society)      (2005)

Strategies for controlling item exposure in computerized adaptive testing with polytomously scored items
L. L. Davis

How Well Can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans
T. K. Landauer and D. Laham and B. Rehder and M. E. Schreiner

From paragraph to graph: latent semantic analysis for information visualization
T. K. Landauer and D. Laham and M. Derr
Proceedings of the National Academy of Sciences USA  101  5214-5219  (2004)

Most techniques for relating textual information rely on intellectually created links such as author-chosen keywords and titles, authority indexing terms, or bibliographic citations. Similarity of the semantic content of whole documents, rather than just titles, abstracts, or overlap of keywords, offers an attractive alternative. Latent semantic analysis provides an effective dimension reduction method for the purpose that reflects synonymy and the sense of arbitrary word combinations. However, latent semantic analysis correlations with human text-to-text similarity judgments are often empirically highest at approximately 300 dimensions. Thus, two- or three-dimensional visualizations are severely limited in what they can show, and the first and/or second automatically discovered principal component, or any three such for that matter, rarely capture all of the relations that might be of interest. It is our conjecture that linguistic meaning is intrinsically and irreducibly very high dimensional. Thus, some method to explore a high dimensional similarity space is needed. But the 2.7 x 10(7) projections and infinite rotations of, for example, a 300-dimensional pattern are impossible to examine. We suggest, however, that the use of a high dimensional dynamic viewer with an effective projection pursuit routine and user control, coupled with the exquisite abilities of the human visual system to extract information about objects and from moving patterns, can often succeed in discovering multiple revealing views that are missed by current computational algorithms. We show some examples of the use of latent semantic analysis to support such visualizations and offer views on future needs.
Modélisation des processus de hiérarchisation et d'application de macrorègles et conception d'un prototype d'aide au résumé
M. Bianco and P. Dessus and B. Lemaire and S. Mandin and P. Mendelsohn

Latent semantic analysis approaches to categorization
D. Laham
    979  (1997)

L'analyse sémantique latente et l'identification des métaphores
Y. Bestgen and A. Cabiaux

A Procedure for Estimating Intrasubject Behavior Consistency
J. M. Hern\'{a}ndez and V. J. Rubio and J. Revuelta and J. Santacreu
Educational and Psychological Measurement  66  417-434  (2006)

Trait psychology implicitly assumes consistency of the personal traits. Mischel, however, argued against the idea of a general consistency of human beings. The present article aims to design a statistical procedure based on an adaptation of the $\pi^*$ statistic to measure the degree of intraindividual consistency independently of the measure used. Three studies were carried out for testing the suitability of the $\pi^*$ statistic and the proportion of subjects who act consistently. Results have shown the appropriateness of the statistic proposed and that the percentage of consistent individuals depends on whether test items can be assumed as equivalents and the number of response alternatives they contained. The results suggest that the percentage of consistent subjects is far from 100%, and this percentage decreases when items are equivalent. Moreover, the greater the number of response options, the lesser the percentage of consistent individuals.
Analysis of distractor difficulty in Multiple-Choice items
J. Revuelta
Psychometrika  69  217-234  (2004)

Two psychometric models are presented for evaluating the difficulty of the distractors in multiple-choice items. They are based on the criterion of rising distractor selection ratios, which facilitates interpretation of the subject and item parameters. Statistical inferential tools are developed in a Bayesian framework: modal a posteriori estimation by application of an EM algorithm and model evaluation by monitoring posterior predictive replications of the data matrix. An educational example with real data is included to exemplify the application of the models and compare them with the nominal categories model.
An ANOVA-like Rasch analysis of differential item functioning
W. Wang

Une étude de l'accord et de la fidélité inter juges comparant un modèle de la théorie de la généralisabilité et un modèle de la famille de Rasch
J. Blais and N. Loye

Objective measurement, Theory into practice
G. Raîche and J. Blais
  6    (2002)

Ideal point estimation with a small number of votes: A random-effects approach
M. Bailey
Political Analysis  9  192-210  (2001)

Contributions à une méthodologie de comparaison de partitions
G. Youness

Practical questions in introducing computerized adaptive testing for K-12 assessments
W. D. Way and L. L. Davis and S. Fitzpatrick

A generalized linear model for principal component analysis of binary data
A. I. Schein and L. K. Saul and L. H. Ungar

Construction d'échelles d'items unidimensionnelles en qualité de vie
J. Hardouin

Graphical models for panel studies, illustrated on data from the framingham heart study
J. P. Klein and N. Keiding and S. Kreiner

A visual guide to item response theory
I. Partchev

Statistical test theory for education and psychology
D. N. M. de Gruijter and L. J. T. van der Kamp

Explanatory Item Response Models: a Generalized Linear and Nonlinear Approach
P. D. Boeck and M. Wilson
A nonlinear mixed model framework for item response theory
F. Rijmen and F. Tuerlinckx and P. D. Boeck and P. Kuppens
Psychological Methods  8  185-205  (2003)

Mixed models take the dependency between observations based on the same cluster into account by introducing 1 or more random effects. Common item response theory (IRT) models introduce latent person variables to model the dependence between responses of the same participant. Assuming a distribution for the latent variables, these IRT models are formally equivalent with nonlinear mixed models. It is shown how a variety of IRT models can be formulated as particular instances of nonlinear mixed models. The unifying framework offers the advantage that relations between different IRT models become explicit and that it is rather straight- forward to see how existing IRT models can be adapted and extended. The ap- proach is illustrated with a self-report study on anger.
A multilevel bayesian item response theory method for scaling socioeconomic status in international studies of education
H. May
Journal of Educational and Behavioral Statistics  31  63-79  (2006)

A new method is presented and implemented for deriving a scale of socioeconomic status (SES) from international survey data using a multilevel Bayesian item response theory (IRT) model. The proposed model incorporates both international anchor items and nation-specific items and is able to (a) produce student family SES scores that are internationally comparable, (b) reduce the influence of irrelevant national differences in culture on the SES scores, and (c) effectively and efficiently deal with the problem of missing data in a manner similar to Rubin's (1987) multiple imputation approach. The results suggest that this model is superior to conventional models in terms of its fit to the data and its ability to use information collected via international surveys.
The concept of validity
D. Borsboom and G. J. Mellenbergh and J. van Heerden
Psychological Review  111  1061-1071  (2004)
This article advances a simple conception of test validity: A test is valid for measuring an attribute if (a) the attribute exists and (b) variations in the attribute causally produce variation in the measurement outcomes. This conception is shown to diverge from current validity theory in several respects. In particular, the emphasis in the proposed conception is on ontology, reference, and causality, whereas current validity theory focuses on epistemology, meaning, and correlation. It is argued that the proposed conception is not only simpler but also theoretically superior to the position taken in the existing literature. Further, it has clear theoretical and practical implications for validation research. Most important, validation research must not be directed at the relation between the measured attribute and other attributes but at the processes that convey the effect of the measured attribute on the test scores.
Validity and assessment: a rasch measurement perspective
T. G. Bond
Metodologia de las Ciencias del Comportamiento  5  179-194  (2003)

This paper argues that the Rasch model, unlike the other models generally referred to as IRT models, and those that fall into the tradition of True Score models, encompasses a set of rigorous prescriptions for what scientific measurement would be like if it were to be achieved in the social sciences. As a direct consequence, the Rasch measurement approach to the construction and monitoring of variables is sensitive to the issues raised in Messick's (1995) broader conception of construct validity. The theory / practice dialectic (Bond & Fox, 2001) ensures that validity is foremost in the mind of those developing measures and that genuine scientific measurement is foremost in the minds of those who seek valid outcomes from assessment. Failures of invariance, such as those referred to as DIF, should alert researchers to the need to modify assessment procedures or the substantive theory under investigation, or both.