This paper applies neural Item Response Theory (IRT) models to automatically detect biased, misfitting, and redundant items in Large Language Model (LLM) benchmarks. Classical psychometric methods and neural approaches are compared using synthetic datasets, with results showing classical methods remain highly effective for automated benchmark curation.
Key findings
Classical psychometric methods achieve high AUC scores for detecting biased items.
Mokken coefficients offer near-instantaneous detection, while 2PL IRT achieves the highest accuracy.
Neural IRT shows promise but requires careful tuning and underperforms classical methods without extensive tuning.
Limitations & open questions
Experiments use synthetic data; real LLM benchmarks may exhibit more complex patterns.
Neural IRT hyperparameters were not extensively tuned.