Neural IRT for Automated Detection of Biased Benchmark It...

ABSTRACT

This paper applies neural Item Response Theory (IRT) models to automatically detect biased, misfitting, and redundant items in Large Language Model (LLM) benchmarks. Classical psychometric methods and neural approaches are compared using synthetic datasets, with results showing classical methods remain highly effective for automated benchmark curation.

PAPER · PDF

neural_irt_paper.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

Classical psychometric methods achieve high AUC scores for detecting biased items.

Mokken coefficients offer near-instantaneous detection, while 2PL IRT achieves the highest accuracy.

Neural IRT shows promise but requires careful tuning and underperforms classical methods without extensive tuning.

Limitations & open questions

Experiments use synthetic data; real LLM benchmarks may exhibit more complex patterns.

Neural IRT hyperparameters were not extensively tuned.

Neural IRT for Automated Detection of Biased Benchmark Items in LLM Evaluation

Key findings

Limitations & open questions

Related Papers