# psychometrics.bib

Dings:2002
The effects of matrix sampling on student score comparability in constructed-response and multiple-choice assessments

(2002)

Thomas:2002b
EMBEDDING IRT IN STRUCTURAL EQUATION MODELS: A COMPARISON WITH REGRESSION BASED ON IRT SCORES

(2002)

Thamerus:1996
Fitting a finite mixture distribution to a variable subject to heteroscedastic measurement error

(1996)

Yamamoto:1999
Scaling Methodology and Procedures for the TIMSS Mathematics and Science Scales

(1999)

Yeh:2007
Using Trapezoidal Rule for the Area Under a Curve Calculation

(2007)

Hardouin:2007b
The SAS Macro-Program %AnaQol to Estimate the Parameters of Item Responses Theory Models

Communications in Statistics - Simulation and Computation  36  437-453  (2007)

Fox:2007a
Multilevel IRT Model Assessment

(2007)

Fox:2007
Modeling Measurement Error in Structural Multilevel Models

(2007)

Ark:2005b
The Effect of Missing Data Imputation on Mokken Scale Analysis

(2005)

Ark:2005a
Statistical Models for Categorical Variables

(2005)

Ark:2002
Hierarchically Related Nonparametric IRT Models, and Practical Data Analysis Methods

(2002)

Sijtsma:2001
Progress in NIRT Analysis of Polytomous Item Scores: Dilemmas and Practical Solutions

(2001)

This paper discusses three open problems in nonparametric polytomous item re- sponse theory: (1) theoretically, the latent trait $\theta$ is not stochastically ordered by the observed total score X+; (2) the models do not imply an invariant item ordering; and (3) the regression of an item score on the total score X+ or on the restscore R is not a monotone nondecreasing function and, as a result, it cannot be used for investigating the monotonicity of the item step response function. Tentative solutions for these problems are discussed. The computer program MSP for nonparametric IRT analysis is based on models which neither imply the stochastic ordering property nor an invariant item ordering. Also, MSP uses item-restscore regression for investigating item step response functions. It is discussed whether computer programs may be based temporarily) on models which lack desirable properties and use methods which are not (yet) supported by sound psychometric theory.
Ark:1999
Contributions to Latent Budget Analysis: A Tool For the Analysis of Compositional Data.

(1999)

Ark:1998
Graphical Display of Latent Budget Analysis and Latent Class Analysis, with Special Reference to Correspondence Analysis

(1998)

Heijden:2002
Some Examples of Latent Budget Analysis and its Extensions

(2002)

Thomas:2002a
APPLYING ITEM RESPONSE THEORY METHODS TO COMPLEX SURVEY DATA

(2002)

Carletta:1996
Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics  22    (1996)

Zou:2004
Sparse principal component analysis

(2004)

Bond:2003b
Measuring Client Satisfaction with Public Education III: Group Effects in Client Satisfaction

Journal of Applied Measurement  4  326-334  (2003)

Bond:2003a
Measuring Client Satisfaction with Public Education II: Comparing Schools with State Benchmarks

Journal of Applied Measurement  4  258-268  (2003)

King:2003
Measuring Client Satisfaction with Public Education I: Meeting Competing Demands in Establishing State-wide Benchmarks

Journal of Applied Measurement  4  111-123  (2003)

Smits:2003a
A Componential IRT Model for Guilt

Multivariate Behavioral Research  38  161-188  (2003)

Jehangir:2005
Evaluation of Relations between Scales in an IRT Framework

(2005)

Schumacher:1996
Neural network and logistic regression. Part I

(1996)

Tricot:2000
Un modèle de réponses aux items. Propriétés et comparaison de groupes de traitement en épidémiologie

Revue de Statistique Appliquée  48  29-39  (2000)

Ricker:2003
Setting Cut Scores: Critical Review of Angoff and Modified-Angoff Methods

(2003)

This paper presents a critical review of the Angoff (1971) and Angoff derived methods, according to criteria for assessing cut score setting methods originally proposed by Berk (1986) and further recommendations by Hambleton (2001). The criteria have been updated to reflect the progress that has been made in standard setting research over the past 17 years. The paper also discusses the assumptions of the Angoff method, and other current issues surrounding this method. Recommendations for using the Angoff method are made.
Sheng:2005
BAYESIAN ANALYSIS OF HIERARCHICAL IRT MODELS: COMPARING AND COMBINING THE UNIDIMENSIONAL & MULTI-UNIDIMENSIONAL IRT MODELS

(2005)

Verstralen:2000
IRT models for subjective weights of options of multiple choice questions

(2000)

Lauritzen:2007
Exchangeable Rasch Matrices

(2007)

Davidson:2006
Bootstrap Inference in a Linear Equation Estimated by Instrumental Variables

(2006)

Festy:2008
MESURES, FORMES ET FACTEURS DE LA PAUVRETÉ. APPROCHES COMPARATIVES

(2008)

Ward:2008
Presence-only data and the EM algorithm

Biometrics      (2008)

Ponocny:2002
On the applicability of some IRT models for repeated measurement designs: Conditions, consequences, and Goodness-of-Fit tests

Methods of Psychological Research Online  7  21-40  (2002)

Rouder:2005
A hierarchical model for estimating response time distributions

Psychonomic Bulletin & Review  12  195-223  (2005)

Castelloe:2007
Power and Sample Size Determination for Linear Models

(2007)

Zubicaray:2007
Support for an auto-associative model of spoken cued recall: Evidence from fMRI

Neuropsychologia  45  824-835  (2007)

Gibbons:2007
The Added Value of Multidimensional IRT Models

(2007)

Diaz:2006
NAEP-QA FY06 Special Study: 12th Grade Math Trend Estimates

(2006)

Keerthi:2002
A fast dual algorithm for kernel logistic regression

(2002)

Bystrom:2007
TASK COMPLEXITY AFFECTS INFORMATION SEEKING AND USE

(2007)

Saxton:2005
Development of a Short Form of the Severe Impairment Battery

American Journal of Geriatric Psychiatry  13    (2005)

Assaf:2007
A new approach for interexaminer reliability data analysis on dental caries calibration

Journal of Applied Oral Science  15    (2007)

Devouche:2003
Les banques d'items. Construction d'une banque pour le Test de Connaissance du Français

Psychologie et Psychométrie  24  57-88  (2003)

Postlethwaite:1993
TORSTEN HUSÉN

Perspectives : revue trimestrielle d'éducation comparée  XXIII  697-707  (1993)

Roju:1995
IRT-Based Internal Measures of Differential Functioning of Items and Tests

Applied Psychological Measurement  19  353-368  (1995)

Jacobusse:2006
An interval scale for development of children aged 0 --2 years

Statistics in Medicine  25  2272-2283  (2006)

Howell:2005
A model of grounded language acquisition: Sensorimotor features improve lexical and grammatical learning

Journal of Memory and Language  53  258-276  (2005)

Hastedt:2007
Differences between multiple-choice and constructed response items in PIRLS 2001

(2007)

Devroye:2007
NON-UNIFORM RANDOM VARIATE GENERATION

(2007)

This chapter provides a survey of the main methods in non-uniform random variate generation, and highlights recent research on the sub ject. Classical paradigms such as inversion, rejection, guide tables, and transformations are reviewed. We provide information on the expected time complexity of various algorithms, before addressing modern topics such as indirectly specified distributions, random processes, and Markov chain methods.
Goldstein:1999
Multilevel statistical models

(1999)

Michailidis:2007
Multilevel Homogeneity Analysis

(2007)

CoE:2005
The Common European Framework

(2005)

Exploratory Measurement Invariance: A New Method Based on Item Response Theory

(2004)

Lunz:2007
Examination Development Guidelines

(2007)

Courville:2004
An empirical comparison of item response theory and classical test theory item/person statistics

(2004)

Keller:2002
Annual College of Education Educational Research Exchange

(2002)

Stage:2007
A Comparison Between Item Analysis Based on Item Response Theory and Classical Test Theory. A Study of the SweSAT Subtest WORD

(2007)

Stage:2003
Classical Test Theory or Item Response Theory: The Swedish Experience

(2003)

Yu:2007
Automation and visualization of distractor analysis using SAS/GRAPH

(2007)

Robinson:2000

25    (2000)

Mojduszka:2000
Consumer Choice of Food Products and the Implications for Price Competition and Government Labeling Policy

(2000)

Garcia-Perez:1999
Fitting Logistic IRTModels: Small Wonder

The Spanish Journal of Psychology  2  74-94  (1999)

Zubairi:2006
Classical And Rasch Analyses Of Dichotomously Scored Reading Comprehension Test Items

Malaysian Journal of ELT Research  2    (2006)

Yamamoto:2002
Estimating PISA students on the IALS prose literacy scale

(2002)

Stewart:2005
Absolute Identification by Relative Judgment

Psychological Review  112  881-911  (2005)

Cazievel:2000
Estimation for the Rasch Model under a linkage structure: a case study

(2000)

Hochheiser:1999

(1999)

E-V-Smith:2006
Book Review: Developing and Validating Multiple-Choice Test Items (3rd ed.)

Applied Psychological Measurement  30  69-72  (2006)

Chen:2006
Verification of Cognitive Attributes Required to Solve the TIMSS-1999 Mathematics Items for Taiwanese Students

(2006)

Shigemasu:2000
Bayesian hierarchical analysis of polytomous item responses

Behaviormetrika  27  51-65  (2000)

Schwarz:1995
What respondents learn from questionnaires: The survey interview and the logic

International Statistical Review  63  153-177  (1995)

Bryce:1981
Rasch-Fitting

British Educational Research Journal  7    (1981)

The Multidimensional Random Coefficients Multinomial Logit Model

Applied Psychological Measurement  21  1-24  (1997)

Monseur:2007
Equating errors in international surveys in education

(2007)

Brown:2005
The Multidimensional Measure of Conceptual Complexity

(2005)

Mitkov:2005
A computer-aided environment for generating multiple-choice test items

Natural Language Engineering  1  1-17  (2005)

Wu:2006
Modelling Mathematics Problem Solving Item Responses Using a Multidimensional IRT Model

Mathematics Education Research Journal  18  93-113  (2006)

Watson:2006
A Longitudinal Study of Student Understanding of Chance and Data

Mathematics Education Research Journal  18  40-55  (2006)

Stacey:2006
A Case of the Inapplicability of the Rasch Model: Mapping Conceptual Learning

Mathematics Education Research Journal  18  77-92  (2006)

Grimbeek:2006
Surveying Primary Teachers about Compulsory Numeracy Testing: Combining Factor Analysis with Rasch Analysis

Mathematics Education Research Journal  18  27-39  (2006)

Doig:2006
Easier Analysis and Better Reporting: Modelling Ordinal Data in MEducation Research

Mathematics Education Research Journal  18  56-76  (2006)

Applying the Rasch Rating Scale Model to Gain Insights into Students' Conceptualisation of Quality Mathematics Instruction

Mathematics Education Research Journal  18  11-26  (2006)

Willms:2007
A Manual for Conducting Analyses with Data from TIMSS and PISA

(2007)

Dray:2003
Co-inertia analysis and the linking of ecological data tables

Ecology  84  3078-3089  (2003)

Leeuw:1986
Random coefficient models for multilevel analysis

Journal of Educational Statistics  11  57-85  (1986)

Benjamini:2002
John W. Tukey's contributions to multiple comparisons

The Annals of Statistics  30  1576-1594  (2002)

Holmes:2005
Multivariate data analysis: The french way

(2005)

Davier:1997
WINMIRA -- program description and recent enhancements

Methods of Psychological Research - Online  2  25-28  (1997)

Hugonot-Diener:2003
Version abrégée de la severe impairment battery (SIB)

Psychologie \& Neuropsychiatrie du Vieillissement  1  273-283  (2003)

CJE:2000

25    (2000)

Antonietti:2006
Mesures objectives de traits latents

(2006)

Antonietti:2004
Comment s'assurer de l'alignement d'un ensemble d'items

(2004)

Antonietti:2003b
Designs de testage incomplets et modèle non-paramétrique de la réponse à l'item

(2003)

Antonietti:2003a
Comment mesurer la similarité entre deux stuctures factorielles latentes

(2003)

Christensen:2003a
SAS macros for Rasch based latent variable modelling

(2003)

Christensen:2003
Latent Covariates in Generalized Linear Models

(2003)

Walker:2000
Forecasting the political behavior of leaders with the verbs in context system of operational code analysis

(2000)

Camiz:2005
Application de l'analyse factorielle multiple pour le traitement de caractères en échelle dans les enquêtes

(2005)

Claeskens:2007
On local estimating equations in additive multiparameter models

(2007)

Al-Kandari:1993
Variable Selection and Principal Component Analysis

(1993)

Calvo:2007
A Comparative Study of Principal Component Analysis Techniques

(2007)

Goodman:2002
Applied Latent Class Analysis

(2002)

Balidis:2002
Intraobserver and interobserver reliability of the R/D score for evaluation of iris configuration by ultrasound biomicroscopy, in patients with pigment dispersion syndrome

Eye  16  722-726  (2002)

Birkett:1986
Selecting the number of response categories for a Lickert-type scale

(1986)

Bhakta:2005
Using item response theory to explore the psychometric properties of extended matching questions examination in undergraduate medical education

BMC Medical Education  5    (2005)

The improved Clinical Global Impression Scale (iCGI): development and validation in depression

BMC Psychiatry  7    (2007)

Revah-Levy:2007

BMC Psychiatry  7    (2007)

Montanari:2000
Independent Factor Discriminant Analysis

(2000)

Schafer:2002
Computational strategies for multivariate linear mixed-effects models with missing values

Journal of Computational and Graphical Statistics  11  437-457  (2002)

Ackerman:1996
Graphical Representation of Multidimensional Item Response Theory Analyses

Applied Psychological Measurement  20  311-329  (1996)

Stein:2007
Calculation of the Kappa Statistic for Inter-rater Reliability: The Case Where Raters Can Select Multiple Responses from a Large Number of Categories

(2007)

Leeuw:2007
Statistics and Probability

(2007)

Graves:1995
The pseudoscience of psychometry and the Bell Curve

Journal of Negro Education  64  277-  (1995)

Cikrikci-Demirtasli:2000
A study of Raven Standard Progressive Matrices test's item measures under classic and item response models: An empirical comparison

(2000)

Stuger:2006
Asymmetric Loss Functions and Sample Size Determination: A Bayesian Approach

Austrian Journal of Statistics  35  57-66  (2006)

Antonietti:2003
Evaluation des compétences en mathématiques en fin de 2e année primaire

(2003)

Charland:1996
Fidélité et validité de la version française du "Children of Alcoholics Screening Test" (CAST)

Revue québécoise de psychologie  17  45-62  (1996)

Wu:2005
Algorithmes et codes R pour la méthode de la pseudo-vraisemblance empirique dans les sondages

Techniques d'enquête  31  261-266  (2005)

Grim:2005
Checking for Nonresponse Bias in Web-Only Surveys of Special Populations using a Mixed-Mode (Web-with-Mail) Design

(2005)

Youngstrom:2002
Reliability Generalization of self-report of emotions when using the Differential Emotions Scale

Educational and Psychological Measurement  62    (2002)

Yin:2000
Assessing the reliability of Beck Depression Inventory scores: Reliability Generalization across studies

Educational and Psychological Measurement  60  201-223  (2000)

Wallace:2002
Reliability Generalization of the Life Satisfaction Index

Educational and Psychological Measurement  62    (2002)

Viswesvaran:2000
Measurement error in "Big Five Factors" personality assessment: Reliability Generalization across studies and measures

Educational and Psychological Measurement  60  224-235  (2000)

Vacha-Haase:2001a
Reliability generalization: Exploring reliability variations on MMPI/MMPI-2 Validity scale scores

Assessment  8  391-401  (2001)

Vacha-Haase:2001
Reliability generalization: Exploring reliability coefficients of MMPI clinical scales scores

Educational and Psychological Measurement  61  45-59  (2001)

Vacha-Haase:2002
Reliability Generalization: Moving toward improved understanding and use of score reliability

Educational and Psychological Measurement  62    (2002)

Vacha-Haase:1998
Reliability generalization: Exploring variance in measurement error affecting score reliability across studies

Educational and Psychological Measurement  58  6-20  (1998)

Thompson:2002b
Stability of the reliability of LibQUAL+TM scores: A "Reliability Generalization" meta-analysis study

Educational and Psychological Measurement  62    (2002)

Reese:2002
A Reliability Generalization study of select measures of adult attachment style

Educational and Psychological Measurement  62    (2002)

Nilsson:2002
Reliability Generalization: An examination of the Career Decision-making Self-efficacy Scale

Educational and Psychological Measurement  62    (2002)

Lane:2002
Expanding reliability generalization methods with KR-21 estimates: An RG study of the Coopersmith Self-esteem Inventory

Educational and Psychological Measurement  62    (2002)

Kieffer:2002
A Reliability Generalization study of the Geriatric Depression Scale (GDS)

Educational and Psychological Measurement  62    (2002)

Henson:2001a
Characterizing measurement error in scores across studies: Some recommendations for conducting "Reliability Generalization" (RG) studies

(2001)

Given the potential value of reliability generalization (RG) studies in the development of cumulative psychometric knowledge, the purpose of this paper is to provide a tutorial on how to conduct such studies and to serve as a guide for researchers wishing to use this methodology. After some brief comments on classical test theory, the paper provides a practical framework for structuring an RG study, including: (1) test selection with an eye toward frequency of test use and reporting practices by authors; (2) development of a coding sheet that will capture potential variation in score reliability across studies; (3) procedural recommendations regarding data collection; (4) identification and use of potential dependent variables; and (5) application of general linear model analyses to the data.
Henson:2001
A reliability generalization study of the Teacher Efficacy Scale and related instruments

Educational and Psychological Measurement  61    (2001)

Henson:2002
Variability and prediction of measurement error in Kolb's Learning Style Inventory scores: A reliability generalization study

Educational and Psychological Measurement  62    (2002)

Helms:1999
Another meta-analysis of the White Racial Identity Attitude Scale's Cronbach alphas: Implications for validity

Measurement and Evaluation in Counseling and Development  32  122-137  (1999)

Hanson:2002
Reliability Generalization of Working Alliance Inventory scale scores

Educational and Psychological Measurement  62    (2002)

Dimitrov:2002
Reliability: Arguments for multiple perspectives and potential problems with generalization across studies

Educational and Psychological Measurement  62    (2002)

Deditius-Island:2002
An examination of the reliability of scores from Zuckerman's Sensation Seeking Scales

Educational and Psychological Measurement  62    (2002)

Caruso:2001a
Reliability of scores from the Eysenck Personality Questionnaire: A Reliability Generalization (RG) study

Educational and Psychological Measurement  61  675-682  (2001)

Caruso:2001
Reliability Generalization of the Junior Eysenck Personality Questionnaire

Personality and Individual Differences  31  173-184  (2001)

A reliability generalization was conducted on the Psychoticism (P), Extraversion (E), Neuroticism (N) and Lie (L) scales of the Junior Eysenck Personality Questionnaire (J-EPQ). Twenty-three studies provided data on 44 samples of children who had been administered the J-EPQ. Score reliability was found to vary significantly both between and within scales. N and L provided the most reliable scores (with median reliabilities of 0.80 and 0.79 respectively) followed by E (median RELIABILITY=0.73) and P (median RELIABILITY=0.68). Scale length was the best predictor of score reliability, but sample gender makeup, language of administration, and the amount of variation in the ages of children in each sample were also significant predictors of reliability for various J-EPQ scales. The results highlight the importance of considering reliability to be a property of scores for a particular group, as opposed to a property of a test generally.
Caruso:2000
Reliability Generalization of the NEO personality scales

Educational and Psychological Measurement  60  236-254  (2000)

Capraro:2002
Myers-Briggs Type Indicator score reliability across studies: A meta-analytic Reliability Generalization study

Educational and Psychological Measurement  62  659-673  (2002)

Voelkle:2007
Effect sizes and F ratios < 1.0

Methodology  3  35-46  (2007)

Standard statistics texts indicate that the expected value of the F ratio is 1.0 (more precisely: N/(N-2)) in a completely balanced fixed-effects ANOVA, when the null hypothesis is true. Even though some authors suggest that the null hypothesis is rarely true in practice (e.g., Meehl, 1990), F ratios < 1.0 are reported quite frequently in the literature. However, standard effect size statistics (e.g., Cohen's f) often yield positive values when F < 1.0, which appears to create confusion about the meaningfulness of effect size statistics when the null hypothesis may be true. Given the repeated emphasis on reporting effect sizes, it is shown that in the face of F < 1.0 it is misleading to only report sample effect size estimates as often recommended. Causes of F ratios < 1.0 are reviewed, illustrated by a short simulation study. The calculation and interpretation of corrected and uncorrected effect size statistics under these conditions are discussed. Computing adjusted measures of association strength and incorporating effect size confidence intervals are helpful in an effort to reduce confusion surrounding results when sample sizes are small. Detailed recommendations are directed to authors, journal editors, and reviewers.
Capraro:2001
Measurement error of scores on the Mathematics Anxiety Rating Scale across studies.

Educational and Psychological Measurement  61  373-386  (2001)

Beretvas:2002
Using mixed-effects models in Reliability Generalization studies

Educational and Psychological Measurement  62    (2002)

Beretvas:2002a
A Reliability Generalization study of the Marlowe-Crowne Social Desirability Scale

Educational and Psychological Measurement  62    (2002)

Barnes:2002
Reliability Generalization of scores on the Speilberger State-trait Anxiety Inventory

Educational and Psychological Measurement  62    (2002)

Steiger:1992
R2: A computer program for interval estimation, power calculation, and hypothesis testing for the squared multiple correlation

Behavior Research Methods, Instruments, and Computers  4  581-582  (1992)

Cumming:2008
Inference by eye: Confidence intervals, and how to read pictures of data

American Psychologist      (2008)

Cumming:2001
A primer on the understanding, use and calculation of confidence intervals that are based on central and noncentral distributions

Educational and Psychological Measurement  61  532-575  (2001)

Algina:2003
Approximate confidence intervals for effect sizes

Educational and Psychological Measurement  63  537-553  (2003)

Vacha-Haase:2004
How to estimate and interpret various effect sizes

Counseling Psychology  51  473-481  (2004)

Thompson:2008a
Complementary methods for research in education

(2008)

Thompson:2008
Research in organizations: Foundational principles, processes, and methods of inquiry

(2008)

Thompson:2002a
What future quantitative social science research could look like: Confidence intervals for effect sizes

Educational Researcher  31  24-31  (2002)

Thompson:2002
"Statistical," "practical," and "clinical": How many kinds of significance do counselors need to consider?

Journal of Counseling and Development  80  64-71  (2002)

Snyder:1993
Evaluating results using corrected and uncorrected effect size estimates

Journal of Experimental Education  61  334-349  (1993)

Rosenthal:1994
The handbook of research synthesis

(1994)

Olejnik:2000
Measures of effect size for comparative studies: Applications, interpretations, and limitations

Contemporary Educational Psychology  25  241-286  (2000)

Kline:2004
Beyond significance testing: Reforming data analysis methods in behavioral research

(2004)

Kirk:2003
Handbook of research methods in experimental psychology

83-105  (2003)

Kirk:1996
Practical significance: A concept whose time has come

Educational and Psychological Measurement  56  746-759  (1996)

Hill:2004
Higher education: Handbook of theory and research

19  175-196  (2004)

Cortina:2000
Effect size for ANOVA designs

(2000)

Thompson:1994
The Concept of Statistical Hypothesis Testing

Measurement Update  4  5-6  (1994)
http://www.coe.tamu.edu/~bthompson/hyptest1.htm
Thompson:1998a
Five methodology errors in educational research: The pantheon of statistical significance and other faux pas

(1998)
Thompson:1999
Common methodology mistakes in educational research, revisited, along with a primer on both effect sizes and the bootstrap

(1999)
Thompson:1998
Statistical significance and effect size reporting: Portrait of a possible future

Research in the Schools  5  33-38  (1998)

Moore:1991
A confirmatory factor analysis of the Threat Index

Journal of Personality and Social Psychology  60  122-129  (1991)

The Threat Index (TI), a measure of death concern grounded in personal construct theory, was submitted to psychometric refinement. The factorability of the TI using the traditional split-match scoring was compared with methods based on Manhattan, Euclidian, standardized Euclidian, and Mahalanobis distance formulas. Statistical and substantive interpretability were enhanced with the standardized Euclidian factor structure. The LISREL VI program was used to determine the best model for the scale in an exploratory factor analysis. A nonhierarchical, G + 3 model met the criterion of goodness of fit >0.9 for the 1st subsample (n = 405). In a confirmatory factor analysis with a 2nd subsample (n = 405), the model was confirmed. Internal consistency and test-retest reliability were acceptable for Global Threat and 3 subfactors--Threat to Well-Being, Uncertainty, and Fatalism--and all subfactors were found to be independent of social desirability.
Agresti:2000
Random effects modeling of categorical response data

(2000)

From:2006
Estimation of the paramters of the Birnbaum-Saunders distribution

Communications in Statistics -- Theory and Methods  35  2157-2169  (2006)

Berg:2007
Variance decomposition using an IRT measurement model

Behavioral Genetics  37  604-616  (2007)

Jackel:2003
A note on multivariate Gauss-Hermite quadrature

(2003)

Boeck:2005
Conceptual and psychometric framework for distinguishing categories and dimensions

Psychological Review  112  129-158  (2005)

Presnell:1994
Resampling methods for sample survey

(1994)

Gonzalez:2006
Numerical integration in logistic-normal models

Computational Statistics \& Data Analysis  51  1535-1548  (2006)

Ip:2004
Locally dependent latent trait model for polytomous responses with application to inventory of hostility

Psychometrika  69  191-216  (2004)

Hedeker:2000
Application of item response theory models for longitudinal data

(2000)

Janssen:1999
Confirmatory analyses of componential test structure using multidimensional item response theory

Multivariate Behavioral Research  34  245-268  (1999)

Komarek:2003
Fast robust logistic regression for large sparse datasets with binary outputs

(2003)

Lubke:2000
Factor-analyzing Likert-scale data under the assumption of mutlivariate normality complicates a meaningful comparison of observed groups or latent classes

(2000)

Leenen:2001
Models for ordinal hierarchical classes analysis

Psychometrika  66  389-404  (2001)

Meulders:2003
A taxonomy of latent structure assumptions for probability matrix decomposition models

Psychometrika  68  61-77  (2003)

Meulders:2005
Latent variable models for partially ordered responses and trajectory analysis of anger-related feelings

British Journal of Mathematical and Statistical Psychology  58  117-143  (2005)

Lawrence:2000
Bayesian inference for ordinal data using multivariate probit models

(2000)

Wilson:2003
On choosing a model for measuring

Methods of Psychological Research Online  8  1-22  (2003)

Wermuth:2000
Analysing social science data with graphical Markov models

(2000)

Tay-Lim:2000
Generating item responses for balanced-incomplete-block (BIB) design using the generalized partial credit model (GPCM)

(2000)

Revelle:1979
Very Simple Structure: An alternative procedure for estimating the optimal number of interpretable factors

Multivariate Behavioral Research  14  403-414  (1979)

Rupp:2007
The development, calibration, and inferential validation of standards-based assessments for english as a first foreign language at the IQB

(2007)

Thacher:2005
Using patient characteristics and attitudinal data to identify depression treatment preference groups: A latent-class model

(2005)

Teresi:2004
Differential item functionning and health assessment

(2004)

Hardouin:2007a
Non parametric item response theory with SAS and Stata

Journal of Statistical Software      (2007)

Norquist:2003
Rasch measurement in the assessment of amytrophic lateral sclerosis patients

Journal of Applied Measurement  4  249-257  (2003)

Groenen:2006
Visions of 70 years of psychometrics: the past, present, and future

Statistica Neerlandica  60  135-144  (2006)

Martin:2007
On the analysis of bayesian semiparametric IRT-type models

(2007)

Fox:2005b
Bayesian modification indices for IRT models

Statistica Neerlandica  59  95-106  (2005)

Garcia-Zattera:2005
Conditional independence of multivariate binary data with an application in caries research

(2005)

Finkelman:2007
Using person fit in a body of work standard setting

(2007)

Hamon:2000
Modèle de Rasch et validation de questionnaires de qualité de vie

(2000)

Hamon:2002
Statistical Methods for Quality of Life Studies. Design, Measurement and Analysis

(2002)

Thomas:2002
Intégration de la théorie de la réponse aux items aux modèles par équations structurelles: Comparaison avec une régression fondée sur des scores TRI

(2002)

Fox:2001a
Multilevel IRT: A bayesian perspective on estimating parameters and testing statistical hypotheses

(2001)

Vermunt:2007
Latent class and finite mixture models for multilevel data sets

Statistical Methods in Medical Research      (2007)

Vermunt:2001a
Modeling joint and marginal distributions in the analysis of categorical panel data

Sociological Methods and Research  30  170-196  (2001)

Vermunt:2001
The use restricted latent class models for defining and testing nonparametric and parame tric IRT models

Applied Psychological Measurement  25  283-294  (2001)

Ark:2005
Stochastic Ordering of the Latent Trait by the Sum Score Under Various Polytomous IRT Models

Psychometrika  70  283-304  (2005)

Fox:2005a
Multilevel IRT using dichotomous and polytomous response data

British Journal of Mathematical and Statistical Psychology  58  145-172  (2005)

Cui:2006
The hierarchy consistency index: A person-fit statistic for the attribute hierarchy model

(2006)

Zijlstra:2007
Outlier detection in test and questionnaire data

Multivariate Behavioral Research      (2007)

Ginkel:2007b
Multiple Imputation of Item Scores in Test and Questionnaire Data, and Influence on Psychometric Results

Multivariate Behavioral Research  42  387-414  (2007)

The performance of five simple multiple imputation methods for dealing with missing data were compared. In addition, random imputation and multivariate normal imputation were used as lower and upper benchmark, respectively. Test data were simulated and item scores were deleted such that they were either missing completely at random, missing at random, or not missing at random. Cronbach's alpha, Loevinger's scalability coefficient H, and the item cluster solution from Mokken scale analysis of the complete data were compared with the corresponding results based on the data including imputed scores. The multiple-imputation methods, two-way with normally distributed errors, corrected item-mean substitution with normally distributed errors, and response function, produced discrepancies in Cronbach's coefficient alpha, Loevinger's coefficient H, and the cluster solution from Mokken scale analysis, that were smaller than the discrepancies in upper benchmark multivariate normal imputation.
Ginkel:2007a
Multiple imputation for item scores when test data are factorially complex

British Journal of Mathematical and Statistical Psychology      (2007)

Multiple imputation under a two-way model with error is a simple and effective method that has been used to handle missing item scores in unidimensional test and questionnaire data. Extensions of this method to multidimensional data are proposed. A simulation study is used to investigate whether these extensions produce biased estimates of important statistics in multidimensional data, and to compare them with lower benchmark listwise deletion, two-way with error and multivariate normal imputation. The new methods produce smaller bias in several psychometrically interesting statistics than the existing methods of two-way with error and multivariate normal imputation. One of these new methods clearly is preferable for handling missing item scores in multidimensional test data.
Rabe-Hesketh:2001b
Multilevel modeling of cognitive function in schizophrenic patients and their first degree relatives

Multivariate Behavioral Research  36  279-298  (2001)

Rossi:2007
Factor analysis of the Dutch-Language version of the MCMI-III

Journal of Personality Assessment  88  144-157  (2007)

Fox:2005
Randomized item response theory models

Journal of Educational and Behavioral Statistics  30  1-24  (2005)

Jong:2007
Using item response theory to measure extreme response style in marketing research: A global investigation

Journal of Marketing Research      (2007)

Petridou:2006
Instability of person misfit and ability estimates subject to assessment modality

(2006)

Hardouin:2007
Mathematical methods for survival analysis, reliability and Quality of life

(2007)

Rabe-Hesketh:2001a
Maximum likelihood estimation of generalized linear model with covariate measurement error

The Stata Journal  1    (2001)

Fox:2004a
Modelling Response Error in School Effectiveness Research

Statistica Neerlandica  58  138-160  (2004)

Fox:2004
Applications of Multilevel IRT Modeling

School Effectiveness and School Improvement  15  261-280  (2004)

Fox:2003
Stochastic EM for Estimating the Parameters of a Multilevel IRT Model

British Journal of Mathematical and Statistical Psychology  56  65-81  (2003)

Fox:2001
Bayesian Estimation of a Multilevel IRT Model using Gibbs Sampling

Psychometrika  66  269-286  (2001)

Fox:2000
Bayesian Modeling of Measurement Error in Predictor Variables Using Item Response Theory

(2000)

Jara:2007
A Dirichlet process mixture model for the analysis of correlated binary responses

Computational Statistics \& Data Analysis  51  5402-5415  (2007)

The multivariate probit model is a popular choice for modelling correlated binary responses. It assumes an underlying multivariate normal distribution dichotomized to yield a binary response vector. Other choices for the latent distribution have been suggested, but basically all models assume homogeneity in the correlation structure across the subjects. When interest lies in the association structure, relaxing this homogeneity assumption could be useful. The latent multivariate normal model is replaced by a location and association mixture model defined by a Dirichlet process. Attention is paid to the parameterization of the covariance matrix in order to make the Bayesian computations convenient. The approach is illustrated on a simulated data set and applied to oral health data from the Signal Tandmobiel^(R) study to examine the hypothesis that caries is mainly a spatially local disease.
King:2001
Analyzing incomplete political science data: An alternative algorithm for multiple imputation

American Political Science Review  95  49-69  (2001)

Skrondal:2003
Some applications of generalized linear latent and mixed models in epidemiology: Repeated measures, measurement error and multilevel modeling

Norsk Epidemiologi  13  265-278  (2003)

Pornel:2004
A new statistic to detect misfitting score vector

(2004)

Ginkel:2007
Two-way imputation: A bayesian method for estimating missing scores in tests and questionnaires, and an accurate approximation

Computational Statistics \& Data Analysis  51  4013-4027  (2007)

Rabe-Hesketh:2001
Parametrization of multivariate random effects models for categorical data

Biometrics  57  1256-1264  (2001)

Abswoude:2004a
Mokken scale analysis using hierarchical clustering procedures

Applied Psychological Measurement  28  332-354  (2004)

Abswoude:2004
A comparative study of test data dimensionality assessment procedures under nonparametric IRT models

Applied Psychological Measurement  28  3-24  (2004)

Ark:2001
Relationships and properties of polytomous item response theory models

Applied Psychological Measurement  25  273-282  (2001)

Noortgate:2003
Cross-classification multilevel logistic models in psychometrics

Journal of Educational and Behavioral Statistics  28  369-386  (2003)

Rijmen:2005
A relation between a between-item multidimensional IRT model and the mixture-Rasch model

Psychometrika  70  481-496  (2005)

Martin:2006
IRT models for ability-based guessing

Applied Psychological Measurement  30  183-203  (2006)

Smits:2003
Estimation of the MIRID: A program and a SAS based approach

Behavior Research Methods, Instruments, & Computers  35  537-549  (2003)

Verguts:2001a
Some Mantel-Haenszel tests of Rasch model assumptions

Journal of Mathematical & Statistical Psychology  54  21-37  (2001)

Verguts:2000a
A note on the Martin-Löf test for unidimensionality

Methods of Psychological Research - Online  5  77-82  (2000)

Tuerlinckx:2001
Non-modeled item interactions can lead to distorted discrimination parameters: A case study

Methods of Psychological Research - Online  6  159-174  (2001)

Tuerlinckx:2006
Statistical inference in generalized linear mixed models: A review

British Journal of Mathematical & Statistical Psychology  59  225-255  (2006)

Tuerlinckx:2005
Two interpretations of the discrimination parameter

Psychometrika  70  629-650  (2005)

In this paper we propose two interpretations for the discrimination parameter in the two-parameter logistic model (2PLM). The interpretations are based on the relation between the 2PLM and two stochastic models. In the first interpretation, the 2PLM is linked to a diffusion model so that the probability of absorption equals the 2PLM. The discrimination parameter is the distance between the two absorbing boundaries and therefore the amount of information that has to be collected before a response to an item can be given. For the second interpretation, the 2PLM is connected to a specific type of race model. In the race model, the discrimination parameter is inversely related to the dependency of the information used in the decision process. Extended versions of both models with person-to-person variability in the difficulty parameter are considered. When fitted to a data set, it is shown that a generalization of the race model that allows for dependency between choices and response times (RTs) is the best-fitting model.
Verstralen:2000ab
A DOUBLE HAZARD MODEL FOR MENTAL SPEED

(2000)

The administration of tests via the computer allows the registration of response times along with the actual response. This paper describes a model that combines these two kinds of data to estimate a subject latent variable usually called mental speed, but more appropriately called mental power. The model implies that the expected item score increases with invested time. Nevertheless, it allows for a decreasing expected item score with response time, which is sometimes found in experiments. This paradox is obtained by assuming that a subject not only stops working on a problem because of time pressure, but also when he has solved the problem. The model builds on a familiar framework of IRT models. An MML estimation procedure is developed, and model fit on the item level is evaluated using Lagrange multiplier tests.
Verstralen:1998aa
A Latent IRT Model for Options of Multiple Choice Items

(1998)

A latent IRT model for the analysis of multiple choice questions is proposed. The incorrect options of an item are associated with a decreasing logistic function that models the probability of being judged correct. It is assumed that the correct option is always recognized as such. According to the model a subject selects randomly from the subset of options considered correct. Like its companion treated in Verstralen (1997) the model can be viewed as a generalization of Nedelsky's (1954) method to determine a pass/fail score. With this other model it has in common that the ML latent variable estimator gains some precision compared to binary scoring. Both models also share some other favorable psychometric properties.
Verstralen:1998ab
A Logistic Latent Class Model for Multiple Choice Items

(1998)

A logistic latent class model for the analysis of options of a class of multiple choice items is presented. For each item a set of latent classes with a chain structure is assumed. The probability of latent class membership is modeled by a logistic function. The conditional probability of the observed response, the selection of an option, given the latent class membership is assumed to be constant. The model can be viewed as a generalization of Nedelsky's (1954) method to determine a pass/fail score. Apart from giving a more detailed model on the process of solving a multiple choice item an increase in the precision of latent variable estimates in comparison with binary scoring is achieved. The model is shown to possess some favorable psychometric properties.
Rijn:2000aa
A Selection Procedure for Polytomous Items in Computerized Adaptive Testing

(2000)

In the present study, a procedure which was developed to select dichotomous items in computerized adaptive testing was applied to polytomous items. The aim of this procedure is to select the item with maximum weighted information. In a simulation study, the item information function was integrated over a fixed interval of ability values and the item with the maximum area was selected. This maximum interval information item selection procedure was compared to a maximum point information item selection procedure. No substantial differences between the two item selection procedures were found when computerized adaptive tests were evaluated on bias and root mean square of the ability estimate.
Bechger:2001aa

(2001)

The cluster kappa was proposed by Schouten (1982) as a measure of chance-corrected rater agreement suitable for studies where objects are rated on a categorical scale by two or more judges. We discuss a way to calculate the cluster kappa which is suited even if ratings are missing. Further, we demonstrate how the sampling error of the cluster kappa may be estimated.
Ruitenburg:2006aa
ALGORITHMS FOR PARAMETER ESTIMATION IN THE RASCH MODEL

(2006)

Verschoor:2005aa
An Approximation of Cronbach's Î$\pm$ and its Use in Test Assembly

(2005)

In this paper a new approximation of Cronbach's Î$\pm$ is presented. It is especially suited in the context of test assembly. Using this approximation, two test assembly models are introduced. Being non-linear models, they are solved by Genetic Algorithms as the commonly used Linear Programming methods cannot be used here. A comparison is made with existing test assembly models.
Maris:2004aa
AN INTRODUCTION TO THE DA-T GIBBS SAMPLER FOR THE TWO-PARAMETER LOGISTIC (2PL) MODEL AND ITS APPLICATION

(2004)

Maris:2003aa
Are attitude items monotone or single peaked? An analysis using bayesian methods

(2003)

Hickendorff:2005aa
Clustering Nominal Data with Equivalent Categories: a Simulation Study Comparing Restricted GROUPALS and Restricted Latent Class Analysis

(2005)

Bechger:2003ab
Combining classical test theory and item response theory

(2003)

Straetmans:1998aa
Comparison of Test Administration Procedures for Placement Decisions in a Mathematics Course

(1998)

In this study, three different test administration procedures for making placement decisions in adult education were compared: a paper-based test (PBT), a computer-based test (CBT), and a computerized adaptive test (CAT). All tests were prepared from an item response theory calibrated item bank. The subjects were 90 volunteer students from three adult education schools. They were randomly assigned to one of six experimental groups to take two tests which differed in mode of administration. The results indicate that test performance was not differentially affected by the mode of administration and that the CAT always yielded more accurate ability estimates than the two other test administration procedures. The CAT was also found to be capable of making placement decisions with a test that was on average 24% shorter.
Straetmans:2003aa
Computerize Adaptive Testing: What It Is and How It Works

(2003)

Maris:2003ac
Concerning the identification of the 3PL model

(2003)

Beguin:2001aa
Effect of Noncompensatory Multidimensionality on Separate and Concurrent estimation in IRT Observed Score Equating

(2001)

In this article, the results of a simulation study comparing the performance of separate and concurrent estimation of a unidimensional item response theory (IRT) model applied to multidimensional noncompensatory data are reported. Data were simulated according to a two-dimensional noncompensatory IRT model for both equivalent and nonequivalent groups designs. The criteria used were the accuracy of estimating a distribution of observed scores, and the accuracy of IRT observed score equating. In general, unidimensional concurrent estimation resulted in lower or equivalent total error than separate estimation, although there were a few cases where separate estimation resulted in slightly less error than concurrent estimation. Estimates from the correctly specified multidimensional model generally resulted in less error than estimates from the unidimensional model. The results of this study, along with results from a previous study where data were simulated using a compensatory multidimensional model, make clear that multidimensionality of the data affects the relative performance of separate and concurrent estimation, although the degree to which the unidimensional model produces biased results with multidimensional data depends on the type of multidimensionality present.
Bechger:2000aa
Equivalent Linear Logistic Test Models

(2000)

This paper is about the Linear Logistic Test Model (LLTM). We demonstrate that there are infinitely many equivalent ways to specify a model . An implication is that there may well be many ways to change the specification of a given LLTM and achieve the same improvement in model fit. To illustrate this phenomenon we analyze a real data set us ing a Lagrange multiplier test for the specification of the model.
Maris:2003ab
Equivalent mirid models

(2003)

Verhelst:2000aa
Estimating the Reliability of a Test from a Single Test Administration

(2000)

The article discusses methods of estimating the reliability of a test from a single test administration. In the first part a review of existing indices is given, supplemented with two heuristics to approximate Guttmanís Î»4 and a new similar coefficient. Special attention is given to the greatest lower bound, to its meaning as well as to the problems in computing it. In the second part the relation between Cronbachís Î$\pm$ and the reliability is studied by means of a factorial model for the item scores. This part gives some useful formulae to appreciate the amount with which the reliability is underestimated when Î$\pm$ is used as its estimand. In the last part, the sampling distribution of the indices is investigated by means of two simulation studies, showing that the indices exhibit severe bias, the direction of which depends partly on the factorial structure of the test. For three indices the bias is modeled. The model describes the bias accurately for all cases studied in the simulation studies. It is shown how this bias correction may be applied in the case of a single data set.
Verstralen:2006aa
Explorations in recursive designs

(2006)

Starting from a set of basic designs, more complex designs are created by recursive application of the basic designs. Properties of these designs, and their effects on the accuracy of Rasch CML-parameter estimates are investigated.
Maris:2005aa
FUZZY SET THEORY âŠ† PROBABILITY THEORY?

(2005)

Bechger:2000ab
Identifiability of Non-Linear Logistic Test Models

(2000)

The linear logistic test model (LLTM) specifies the item parameters as a weighted sum of basic parameters. The LLTM is a special case of a more general non-linear logistic test model (NLTM) where the weights are partially unknown. This paper is about the identifiability of the NLTM. Sufficient and necessary conditions for global identifiability are presented for a NLTM where the weights are linear functions, while conditions for local identifiability are shown to require less assumptions. It is also discussed how these conditions are checked using an algorithm due to Bekker, Merckens, and Wansbeek (1994). Several illustrations are given.
Huitzing:2004aa
Infeasibility in Automated Test Assembly Models: A Comparison Study of Different Methods

(2004)

Several techniques exist to automatically put together a test meeting a number of specifications. In an item bank, the items are stored with their characteristics. A test is constructed by selecting a set of items that fulfills the specifications set by the test assembler. Test assembly problems are often formulated in terms of a model consisting of restrictions and an objective to be maximized or minimized. A problem arises when it is impossible to construct a test from the item pool that meets all specifications, that is, when the model is not feasible. Several methods exist to handle these infeasibility problems. In this paper, test assembly models resulting from two practical testing programs were reconstructed to be infeasible. These models were analyzed using methods that either forced a solution (Goal programming, Multiple-Goal programming, Greedy Heuristic), that analyzed the causes (Relaxed and Ordered Deletion Algorithm, Integer Randomized Deletion Algorithm, Set Covering and Item Sampling), or that analyzed the causes and used this information to force a solution (Irreducible-Infeasible-Set Solver). Specialized methods like the Integer Randomized Deletion Algorithm, and the Irreducible-Infeasible-Set-Solver performed best. Recommendations about the use of different methods are given.
Verstralen:2000aa
IRT models for subjective weights of options of multiple choice questions.

(2000)

From earlier investigations it was found that the information from Multiple Choice (MC) questions could be increased about four fold by having the subject indicate the subset of options that he is unable to expose as false. In the present models this approach is general ized by having the subject distribute a number of 'taws' over the options, or draw a line after the options, such that the number of taws given to an option, or the line length rejects its subjective degree of correctness. It appears that even with values of the relevant parameters that seem modest, the information relative to binary scoring still is in excess of two. This means that with less than half the test length the same accuracy or reliability can be obtained as with binary scoring. With a real data set we found a relative information greater than five. If a few main fallacies can be rejected in the distractors of the items, the model can be applied to identify subjects with one of these fallacies.
Verschoor:2004aa
IRT Test Assembly Using Genetic Algorithms

(2004)

This paper intro duces a new class of ptimisation methods in test assembly: Genetic Algorithms (GAs). In the first part an overview is given of the concepts and principles of GAs, in the second part they are applied to three commonly used test assembly models using Item Response Theory. Simulation studies are performed in order to find conditions under which GAs can be successfully used.
Eggen:1998ab
Item Selection in Adaptive Testing with the Sequential Probability Ratio Test

(1998)

Computerized adaptive tests (CATs) were originally developed to obtain an efficient estimate of an examinee's ability. For classification problems, applications of the Sequential Probability Ratio Test (Wald, 1947) have been shown to be a promising alternative for testing algorithms which are based on statistical estimation. However, the method of item selection currently being used in these algorithms, which use statistical testing to infer on the examinees, is either random or based on a criterion which is related to optimizing estimates of examinees (maximum (Fisher) information). In this study, an item selection method based on Kullback-Leibler information is presented, which is theoretically more suitable for statistical testing problems and which can improve the testing algorithm for classification problems. Simulation studies were conducted for two- and three-way classification problems, in which item selection based on Fisher information and Kullback-Leibler information were compared. The results of these studies showed that the performance of the testing algorithms with Kullback-Leibler information-based item selection are sometimes better and never worse than algorithms with Fisher information-based item selection.
Eggen:2004aa
Loss of Information in Estimating Item Parameters in Incomplete Designs

(2004)

In this paper, the efficiency of conditional maximum likelihood (CML) and marginal maximum likelihood (MML) estimation of the item parameters of the Rasch model in incomplete designs is studied. The use of the concept of F-information (Eggen, 2000) is generalized to incomplete testing designs. The standardized determinant of the F-information matrix is used for a scalar measure of information in a set of item parameters. In this paper, the relation between the normalization of the Rasch model and this determinant is clarified. It is shown that comparing estimation methods with the defined information efficiency is independent of the chosen normalization. In examples, information comparisons are conducted. It is found that for both CML and MML some information is lost in all incomplete designs compared to complete designs. A general trend is that with increasing test booklet length the efficiency of an incomplete to a complete design and also the efficiency of CML compared to MML is increasing. The main differences between CML and MML is seen in relation to the length of the test booklet. It will be demonstrated that with very small booklets, there is a substantial loss in information (about 35%) with CML estimation, while this loss is only about 10% in MML estimation. However, with increasing test length, the differences between CML and MML quickly disappear.
Verhelst:1998aa
Modeling Sums of Binary Responses by the Partial Credit Model

(1998)

The Partial Credit Model (PCM) is sometimes interpreted as a model for stepwise solution of polytomously scored items, where the item parameters are interpreted as difficulties of the steps. It is argued that this interpretation is not justified. A model for stepwise solution is discussed. It is shown that the PCM is suited to model sums of binary responses which are not supposed to be stochastically independent. As a practical result, a statistical test of stochastic independence in the Rasch model is derived.
Hemker:2000aa
On Measurement Properties of Continuation Ratio Models

(2000)

Three classes of polytomous IRT models are distinguished. These classes are the adjacent category models, the cumulative probability models, and the continuation ratio models. So far, the latter class has received relatively little attention. The class of continuation ratio models includes logistic models, such as the sequential model (Tutz, 1990), and non-logistic models, such as the acceleration model (Samejima, 1995) and the nonparametric sequential model (Hemker, 1996). Four measurement properties are discussed. These are monotone likelihood ratio of the total score, stochastic ordering of the latent trait by the total score, stochastic ordering of the total score by the latent trait, and invariant item ordering. These properties have been investigated previously for the adjacent category models and the cumulative probability models, and for the continuation ratio models this is done here. It is shown that stochastic ordering of the total score by the latent trait is implied by all continuation ratio models, while monotone likelihood ratio of the total score and stochastic ordering on the latent trait by the total score are not implied by any of the continuation ratio models. Only the sequential rating scale model implies the property of invariant item ordering. Also, we present a Venn-diagram showing the relationships between all known polytomous IRT models from all three classes.
Eggen:1998aa
On the Loss of Information in Conditional Maximum Likelihood Estimation of Item Parameters

(1998)

In item response models of the Rasch type (Fischer & Molenaar, 1995), item parameters are often estimated by the conditional maximum likelihood (CML) method. This paper addresses the loss of information in CML estimation by using the information concept of F-information (Liang, 1983). This concept makes it possible to specify the conditions for no loss of information and to define a quantification of information loss. For the dichotomous Rasch model, the derivations will be given in detail to show the use of the F-information concept for making efficiency comparisons for different estimation methods. It is shown that by using CML for item parameter estimation, some information is almost always lost. But compared to JML (joint maximum likelihood) as well as to MML (marginal maximum likelihood) the loss is very small. The reported efficiency of CML to JML and to MML in several comparisons is always larger than 93%, and in tests with a length of 20 items or more, larger than 99%.
Eggen:2004ab
Optimal Testing With Easy or Difficult Items in Computerized Adaptive Testing

(2004)

Computerized adaptive tests (CATs) are individualized tests which, from a measurement point of view, are optimal for each individual, possibly under some practical conditions. In the present study it is shown that maximum information item selection in CATs using an item bank which is calibrated with the one- or the two-parameter logistic model, results in each individual answering about 50% of the items correctly. Two item selection procedures giving easier (or more difficult) tests for students are presented and evaluated. Item selection on probability points of items yields good results only with the 1pl model and not with the 2pl model. An alternative selection procedure, based on maximum information at a shifted ability level, gives satisfactory results with both models.
Eggen:2001aa
Overexposure and underexposure of items in computerized adaptive testing

(2001)

Computerized adaptive tests (CATS) have shown to be considerably more efficient than paper-and-pencil tests. This gain is realized by offering each candidate the most informative item from an available item bank on the basis of the results of items that have already been administered. The item selection methods that are used to compose an optimum test for each individual do, however, have a number of drawbacks. Though a CAT generally presents each candidate with a different test, it often occurs that some items from the item bank are administered very frequently while others are never or hardly ever used. These two problems, i.e., overexposure and underexposure of items, can be eliminated by adding further restrictions to the item selection methods. However, this exposure control will affect the efficiency of the CAT. This paper presents a solution for both problems. The functioning of these methods will be illustrated with the results of simulation research that has been carried out to develop adaptive tests.
Roelofs:2001aa
Preferences for various learning environments: Teachers' and parents' perceptions

(2001)

In the last ten years, a number of innovations, mainly inspired by constructivist notions of learning, have been introduced in various levels of the Dutch educational system. However, constructivist learning environments are rarely implemented. Teachers tend to stick to expository and structured learning environments. This consistent finding requires research in order to gain insight into teachersí preferences for learning environments and to determine the factors that support and impede the realization of these learning environments. Regarding the influence of social backgrounds on student learning, is it also important to take stock of parental views on learning environments. This study is focused on teachers' preferences for learning environments, their reported teaching behavior, and how these match with parents' preferences. Three parallel questionnaires were developed for teachers (n=281), students (n=952), and parents (n=717), measuring preferences and behavior in different levels of education, for three types of learning environments: direct instruction, discovery learning, and authentic pedagogy. The results show that teachers often prefer direct instruction, and seldom promote discovery learning. While teachers sometimes realize authentic pedagogy, constructive learning tasks are seldom used. Teachers' reported practice and parents' preferences for their children appear to correspond reasonably. Results of multiple regression analyses show that the use of the three types of learning environments yield different predictors. For the use of discovery learning and authentic pedagogy, confidence in students' regulative skills is an important predictor. In predicting the use of direct instruction, the teacher's own conception of learning turns out to be an important predictor.
Bechger:2004aa
STRUCTURAL EQUATION MODELLING OF MULTIPLE FACET DATA: EXTENDING MODELS FOR MULTITRAIT-MULTIMETHOD DATA

(2004)

This paper is about the structural equation modelling of quantitative measures that are obtained from a multiple facet design. A facet is simply a set consisting of a finite number of elements. It is assumed that measures are obtained by combining each element of each facet. Methods and traits are two such facets, and a multitrait-multimethod study is a two-facet design. We extend models that were proposed for multitrait-multimethod data by Wothke (1984;1996) and Browne (1984, 1989, 1993), and demonstrate how they can be fitted using standard software for structural equation modelling. Each model is derived from the model for individual measurements in order to clarify the first principles underlying each model.
Verhelst:2002aa
Testing the unidimensionality assumption of the Rasch model

(2002)

Statistical tests especially designed to test the unidimensionality axiom of the Rasch model are scarce. For two of them, the Martin-Löf test (ML-test) and the splitter-item-technique, an extensive power analysis has been carried out , showing clearly the superiority of the ML-test. The disadvantage of the ML-test, however, is that its null distribution deviates strongly from the asymptotic chi-square distribution unless one has huge samples. A new test with one degree of freedom is proposed. Its power is superior to that of the ML-test, and its null distribution converges rapidly to the chi-square.
Verstralen:2001aa
The Combined Use of Classical Test Theory and Item Response Theory

(2001)

The present paper is about a number of relations between concepts of models from classical test theory (CTT), such as reliability, and item response theory (IRT). It is demonstrated that the use of IRT models allows us to extend the range of applications of CTT, and investigate relations among concepts that are central in CTT such as reliability and item-test correlation.
Bechger:2003ac
The componential Nedelsky model: A first exploration

(2003)

Bechger:2003aa
The Nedelsky model for multiple choice items

(2003)

Two methods for the practical analysis of rating data

(2003)

Verguts:2001
Some Mantel-Haenszel tests of Rasch model assumptions

Journal of Mathematical and Statistical Psychology  54  21-37  (2001)

Verguts:2000
A note on the Martin-Löf test for unidimensionality

Methods of Psychological Research Online  5    (2000)

Prieto:2003
Classical test theory versus Rasch analysis for quality of life questionnaire reduction

Health and Quality of Life Outcomes  1    (2003)

Jiao:2004
Evaluating the dimensionality of the Michigan English Language Assessment Battery

2  27-52  (2004)

Rizopoulos:2005
Nonlinear effects in generalized latent variable models

(2005)

Smith:2007
A Rasch and factor analysis of the Functional Assessment of Cancer Therapy-General (FACT-G)

Health and Quality of Life Outcomes  5    (2007)

Raiche:2005
Critical eigenvalue sizes in standardized residual principal components analysis

Rasch Measurement Transactions  19  1012  (2005)

Smith:2005
Rasch analysis of the dimensional structure of the hospital anxiety and depression scale

Psycho-Oncology      (2005)

Flieller:1994
Méthodes d'étude de l'adéquation au modèle logistique à un paramètre (modèle de Rasch)

Mathématiques et Sciences Humaines  127  19-47  (1994)

Orlando:2000
Critical issues to address when applying Item Response Theory (IRT) models

(2000)

Bock:1970
Fitting a response model for n dichotomously scored items

Psychometrika  35  179-197  (1970)

Socan:2000
Assessment of reliability when test items are not essentially tau-equivalent

(2000)

Junker:1996
Exploring monotonicity in polytomous item response data

(1996)

Junker:2000a
Nonparametric IRT in Action: An overview of the special issue

(2000)

Linardakis:1996
An approach to multidimensional item response modeling

(1996)

Junker:2000
Monotonicity and conditional independence in models for student assessment and attitude measurement

(2000)

Rudner:2001
Measurement Decision Theory

(2001)

Bentler:2004
Maximal reliability for unit-weighted composites

(2004)

Sijtsma:1994
A survey of theory and methods of invariant item ordering

(1994)

Mazor:1995
Using logistic regression and the Mantel-Haenszel with multiple ability estimates to detect Differential Item Functioning

Journal of Educational Measurement  32  131-144  (1995)

Zwick:1990
When do item reponse function and Mantel-Haenszel definitions of Differential Item Functioning coincide?

Journal of Educational Statistics  15  185-197  (1990)

Shapiro:2000
The asymptotic bias of minimum trace factor analysis, with applications to the greatest lower bound to reliability

Psychometrika  65  413-425  (2000)

Callender:1979
An empirical comparison of coefficient alpha, guttman's lambda-2, and MSPLIT maximized split-half reliability estimates

Journal of Educational Measurement  16  89  (1979)

Algorithms for computerized test construction using classical item parameters

Journal of Educational Statistics  14  279-290  (1989)

Armstrong:1998
Optimization of classical reliability in test construction

Journal of Educational and Behavioral Statistics  23  1-17  (1998)

Raiche:2002a
La simulation d'un test adaptatif basé sur le modèle de Rasch

(2002)

Kintsch:1999
The role of long-term memory in text comprehension

Psychologia  42  186-198  (1999)

Hermann:1999
Assessing leadership style: A trait analysis

(1999)

Latent semantic indexing: A probabilistic analysis

(1997)

Deerwester:1990
Indexing by latent semantic analysis

Journal of the American Society for Information Science  41  391-407  (1990)

Leeuw:2003
Principal component analysis with binary data. Applications to roll-call analysis

(2003)

Lawrence:2005
Probabilistic non-linear principal component analysis with gaussian process latent variables models

Journal of Machine Learning Research  6  1783-1816  (2005)

Kemkes:2006
Objective scoring for computing competition tasks

(2006)

Eggen:2005

(2005)

Krzanowski:2006
Sensitivity in metric scaling and analysis of distance

Biometrics  62  239-244  (2006)

Hofmann:1999
Probabilistic latent semantic analysis

(1999)

Farahat:2006
Improving probabilistic latent semantic analysis with principal component analysis

(2006)

Allegre:2003
Un système d'observation et d'analyse en direct de séances d'enseignement

(2003)

Rehder:1998
Using latent semantic analysis to assess knowledge: Some technical considerations

(1998)

Wolfe:1998
Learning from text: Matching readers and texts by latent semantic analysis

Discourse Processes  25  309-336  (1998)

Foltz:1998
The measurement of textual coherence with latent semantic analysis

Discourse Processes  25  285-307  (1998)

Landauer:1998
An introduction to latent semantic analyses

Discourse Processes  25  259-284  (1998)

Huang:2003
Psychometric analyses based on evidence-centered design and cognitive science of learning to explore students' problem-solving in physics

(2003)

Lazarevska:2005
The distinctive language of terrorists

(2005)

Zumbo:1999
A handbook on the theory and methods of Differential Item Functioning (DIF)

(1999)

Agresti:2005
Bayesian inference for categorical data analysis

Statistical Methods and Application (Journal of the Italian Statistical Society)      (2005)

Davis:2002
Strategies for controlling item exposure in computerized adaptive testing with polytomously scored items

(2002)

Landauer:1997
How Well Can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans

(1997)

Landauer:2004
From paragraph to graph: latent semantic analysis for information visualization

Proceedings of the National Academy of Sciences USA  101  5214-5219  (2004)

Most techniques for relating textual information rely on intellectually created links such as author-chosen keywords and titles, authority indexing terms, or bibliographic citations. Similarity of the semantic content of whole documents, rather than just titles, abstracts, or overlap of keywords, offers an attractive alternative. Latent semantic analysis provides an effective dimension reduction method for the purpose that reflects synonymy and the sense of arbitrary word combinations. However, latent semantic analysis correlations with human text-to-text similarity judgments are often empirically highest at approximately 300 dimensions. Thus, two- or three-dimensional visualizations are severely limited in what they can show, and the first and/or second automatically discovered principal component, or any three such for that matter, rarely capture all of the relations that might be of interest. It is our conjecture that linguistic meaning is intrinsically and irreducibly very high dimensional. Thus, some method to explore a high dimensional similarity space is needed. But the 2.7 x 10(7) projections and infinite rotations of, for example, a 300-dimensional pattern are impossible to examine. We suggest, however, that the use of a high dimensional dynamic viewer with an effective projection pursuit routine and user control, coupled with the exquisite abilities of the human visual system to extract information about objects and from moving patterns, can often succeed in discovering multiple revealing views that are missed by current computational algorithms. We show some examples of the use of latent semantic analysis to support such visualizations and offer views on future needs.
Bianco:2005
Modélisation des processus de hiérarchisation et d'application de macrorègles et conception d'un prototype d'aide au résumé

(2005)

Laham:1997
Latent semantic analysis approaches to categorization

979  (1997)

Bestgen:2002
L'analyse sémantique latente et l'identification des métaphores

(2002)

Hernandez:2006
A Procedure for Estimating Intrasubject Behavior Consistency

Educational and Psychological Measurement  66  417-434  (2006)

Trait psychology implicitly assumes consistency of the personal traits. Mischel, however, argued against the idea of a general consistency of human beings. The present article aims to design a statistical procedure based on an adaptation of the $\pi^*$ statistic to measure the degree of intraindividual consistency independently of the measure used. Three studies were carried out for testing the suitability of the $\pi^*$ statistic and the proportion of subjects who act consistently. Results have shown the appropriateness of the statistic proposed and that the percentage of consistent individuals depends on whether test items can be assumed as equivalents and the number of response alternatives they contained. The results suggest that the percentage of consistent subjects is far from 100%, and this percentage decreases when items are equivalent. Moreover, the greater the number of response options, the lesser the percentage of consistent individuals.
Revuelta:2004
Analysis of distractor difficulty in Multiple-Choice items

Psychometrika  69  217-234  (2004)

Two psychometric models are presented for evaluating the difficulty of the distractors in multiple-choice items. They are based on the criterion of rising distractor selection ratios, which facilitates interpretation of the subject and item parameters. Statistical inferential tools are developed in a Bayesian framework: modal a posteriori estimation by application of an EM algorithm and model evaluation by monitoring posterior predictive replications of the data matrix. An educational example with real data is included to exemplify the application of the models and compare them with the nominal categories model.
Wang:1998
An ANOVA-like Rasch analysis of differential item functioning

(1998)

Blais:2003
Une étude de l'accord et de la fidélité inter juges comparant un modèle de la théorie de la généralisabilité et un modèle de la famille de Rasch

(2003)

Raiche:2002
Objective measurement, Theory into practice

6    (2002)

Bailey:2001
Ideal point estimation with a small number of votes: A random-effects approach

Political Analysis  9  192-210  (2001)

Youness:2004
Contributions à une méthodologie de comparaison de partitions

(2004)

Way:2006
Practical questions in introducing computerized adaptive testing for K-12 assessments

(2006)

Schein:2003
A generalized linear model for principal component analysis of binary data

(2003)

Hardouin:2005
Construction d'échelles d'items unidimensionnelles en qualité de vie

(2005)

Klein:2005
Graphical models for panel studies, illustrated on data from the framingham heart study

(2005)

Partchev:2004
A visual guide to item response theory

(2004)

Gruijter:2005
Statistical test theory for education and psychology

(2005)

Boeck:2004
Explanatory Item Response Models: a Generalized Linear and Nonlinear Approach

(2004)
Rijmen:2003
A nonlinear mixed model framework for item response theory

Psychological Methods  8  185-205  (2003)

Mixed models take the dependency between observations based on the same cluster into account by introducing 1 or more random effects. Common item response theory (IRT) models introduce latent person variables to model the dependence between responses of the same participant. Assuming a distribution for the latent variables, these IRT models are formally equivalent with nonlinear mixed models. It is shown how a variety of IRT models can be formulated as particular instances of nonlinear mixed models. The unifying framework offers the advantage that relations between different IRT models become explicit and that it is rather straight- forward to see how existing IRT models can be adapted and extended. The ap- proach is illustrated with a self-report study on anger.
May:2006
A multilevel bayesian item response theory method for scaling socioeconomic status in international studies of education

Journal of Educational and Behavioral Statistics  31  63-79  (2006)

A new method is presented and implemented for deriving a scale of socioeconomic status (SES) from international survey data using a multilevel Bayesian item response theory (IRT) model. The proposed model incorporates both international anchor items and nation-specific items and is able to (a) produce student family SES scores that are internationally comparable, (b) reduce the influence of irrelevant national differences in culture on the SES scores, and (c) effectively and efficiently deal with the problem of missing data in a manner similar to Rubin's (1987) multiple imputation approach. The results suggest that this model is superior to conventional models in terms of its fit to the data and its ability to use information collected via international surveys.
Borsboom:2004fj
The concept of validity

Psychological Review  111  1061-1071  (2004)
http://users.fmg.uva.nl/dborsboom/papers.htm
This article advances a simple conception of test validity: A test is valid for measuring an attribute if (a) the attribute exists and (b) variations in the attribute causally produce variation in the measurement outcomes. This conception is shown to diverge from current validity theory in several respects. In particular, the emphasis in the proposed conception is on ontology, reference, and causality, whereas current validity theory focuses on epistemology, meaning, and correlation. It is argued that the proposed conception is not only simpler but also theoretically superior to the position taken in the existing literature. Further, it has clear theoretical and practical implications for validation research. Most important, validation research must not be directed at the relation between the measured attribute and other attributes but at the processes that convey the effect of the measured attribute on the test scores.
Bond:2003
Validity and assessment: a rasch measurement perspective

Metodologia de las Ciencias del Comportamiento  5  179-194  (2003)

This paper argues that the Rasch model, unlike the other models generally referred to as IRT models, and those that fall into the tradition of True Score models, encompasses a set of rigorous prescriptions for what scientific measurement would be like if it were to be achieved in the social sciences. As a direct consequence, the Rasch measurement approach to the construction and monitoring of variables is sensitive to the issues raised in Messick's (1995) broader conception of construct validity. The theory / practice dialectic (Bond & Fox, 2001) ensures that validity is foremost in the mind of those developing measures and that genuine scientific measurement is foremost in the minds of those who seek valid outcomes from assessment. Failures of invariance, such as those referred to as DIF, should alert researchers to the need to modify assessment procedures or the substantive theory under investigation, or both.