This paper proposes a separation-of-power architecture for LLM multi-agent systems to mitigate self-approval bias and increase safety. It introduces an adversarial verifier agent to actively probe executors for unsafe behavior. Benchmarking on simulated tasks shows significant reduction in self-approval bypass rates.
Key findings
Separation-of-power architectures reduce self-approval bypass rates by up to 83.1 percentage points.
Adversarial verification maintains comparable task completion rates.
Architectures with adversarial verification achieve 0% bypass rates.
Limitations & open questions
Synthetic data used for evaluation may not reflect real-world LLM behavior.
Limited task scope; additional domains may show different dynamics.
Adversarial verifier uses static probes; adaptive adversaries could evade detection.