This paper introduces a novel method for detecting persona drift in Large Language Models (LLMs) during multi-turn conversations. The method uses embedding trajectory analysis to monitor persona representation drift in LLM hidden states, enabling early warning of consistency violations before they manifest in output text.
Key findings
Proposes embedding trajectory analysis for real-time persona drift detection.
Develops three drift detection mechanisms: cosine similarity, trajectory curvature, and embedding velocity.
Demonstrates early warning capabilities, detecting drift multiple turns before it appears in output text.
Limitations & open questions
The method's performance may vary across different types of persona drift patterns.
Further research is needed to improve detection accuracy in real-world applications.