This paper presents a theoretical analysis of token-frequency bias in policy gradient methods used for clinical text generation. It formalizes the bias mechanism, proves inherent amplification of high-frequency tokens, and proposes Frequency-Aware Policy Optimization (FAPO) to reduce bias and improve medical terminology recall.
Key findings
Policy gradient methods exhibit a systematic bias towards high-frequency tokens.
Standard REINFORCE and PPO objectives suppress rare but clinically relevant terms.
Frequency-Aware Policy Optimization (FAPO) reduces frequency bias by 34% and improves medical terminology recall by 28%.
Theoretical framework reveals bias magnitude scales with the inverse of token frequency and advantage function variance.
Limitations & open questions
The study focuses on clinical text generation and may not generalize to other domains.
Further research is needed to evaluate FAPO's effectiveness across different language models and datasets.