This paper proposes a methodological framework to evaluate LLM agent reasoning on multi-hop biomedical KG traversals, focusing on reasoning fidelity, path faithfulness, and clinical safety. It introduces a taxonomy of reasoning patterns, a multi-dimensional evaluation protocol, and a benchmark suite for systematic assessment.
Key findings
Proposes a comprehensive framework for evaluating LLM agents on biomedical KG traversal tasks.
Introduces a taxonomy of multi-hop reasoning patterns specific to biomedical KGs.
Develops a benchmark suite with clinically validated questions requiring 2-5 hop reasoning.
Presents novel metrics for hallucination detection and correlates graph topological features with reasoning difficulty.
Limitations & open questions
The framework's effectiveness is contingent on the quality and coverage of the biomedical KGs used.
The evaluation metrics may need further refinement as more complex reasoning tasks are identified.