This paper introduces HIPO, a novel constrained reinforcement learning framework that models instruction hierarchies to maintain adherence to high-level instructions across extended conversations. HIPO uses a two-level policy structure and formalizes dialogue as a Constrained Markov Decision Process with hierarchical costs, reducing instruction violation rates significantly.
Key findings
HIPO reduces instruction hierarchy violations by 67% while maintaining competitive task success rates.
The framework formalizes multi-turn dialogue as a Hierarchically-Constrained Markov Decision Process (HCMDP).
Introduces a two-level policy architecture for explicit constraint management without compromising fluency.
Extends DPO to optimize over complete dialogue trajectories, capturing long-term instruction adherence.
Limitations & open questions
The paper does not discuss the scalability of HIPO to larger or more complex instruction sets.
Further research is needed to evaluate HIPO's performance in real-world multi-turn dialogue scenarios.