What question did this study set out to answer?

This research aims to evaluate the effectiveness of e-values in controlling the false discovery rate in differential item functioning detection.

April 18, 2026

Controlling the False Discovery Rate in DIF Detection With e-Values: Evidence From Multidimensional and Testlet Simulations

Key Points

This research aims to evaluate the effectiveness of e-values in controlling the false discovery rate in differential item functioning detection.
Conducted two simulation studies under multidimensional and testlet-based local dependence scenarios.
Applied e-BH procedures using K-fold and Multisplit likelihood-ratio e-values.
Compared e-BH performance against classical methods like Benjamini-Hochberg and Holm.
e-BH consistently provided stronger control of Type I error and false discovery rates compared to classical methods.
Inflation of Type I error was observed in classical p-value methods as sample size increased.
Empirical application on PIRLS data showed e-BH produced a more defensible set of DIF flags than traditional approaches.

Abstract

This study presents the first application of e-value–based false discovery rate (FDR) control to Differential Item Functioning (DIF) detection, addressing long-standing limitations of p -value-based approaches when model assumptions are violated—for example, under multidimensionality, local item dependence, or extreme sample sizes. Two comprehensive simulation studies were conducted to evaluate e-BH (the e-value analogue of BH) procedures, using K-fold and Multisplit likelihood-ratio e-values, under (a) multidimensional contamination and (b) testlet-based local dependence. Across both scenarios, e-BH consistently provided stronger and more stable control of Type I error, FDR, and family-wise error rate (FWER) than classical procedures such as Benjamini–Hochberg (BH) and Holm. Even under severe model misspecification, e-BH maintained substantially lower false-positive rates while remaining relatively competitive in terms of Type II error. A key finding concerns sample size: classical p -value methods exhibited inflation of Type I error as N increased, whereas e-BH preserved stable error control due to its model-agnostic calibration. An empirical application using Progress in International Reading Literacy Study (PIRLS) data further demonstrated that e-BH produces a more defensible and operationally sustainable set of DIF flags than traditional approaches. Together, these results establish e-values as a powerful and robust evidential tool for DIF detection in modern assessment contexts.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Shan Huang

Chungwoon University

David Goretzko

Goethe University Frankfurt

Journals

Educational and Psychological Measurement

Actions

Institutions

Goethe University Frankfurt

Chungwoon University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Controlling the False Discovery Rate in DIF Detection With e-Values: Evidence From Multidimensional and Testlet Simulations

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider