This paper presents a benchmarking study of Large Language Models for zero-shot citation extraction across multiple scholarly platforms, languages, and citation conventions, revealing challenges and proposing strategies for robust multi-style parsing.
Key findings
LLMs like GPT-4 and Claude show potential for non-English citation styles.
Baseline parsers struggle with compound surnames, German compound nouns, and archival citations.
Systematic failure modes identified include parsing compound surnames and handling multilingual content.
Prompting strategies can improve robustness across citation conventions.
Limitations & open questions
Limited evaluation to three datasets may not capture all citation style variations.
LLMs may struggle with precise field boundaries and formatting consistency.