Adaptive Token Budgeting for Redundancy-Aware RAG

ABSTRACT

This paper presents ATRCS (Adaptive Token Budgeting with Redundancy-aware Context Selection), a framework that addresses context redundancy in retrieval-augmented generation systems. The method introduces a set-level objective function that explicitly trades off query-chunk relevance against intra-set redundancy under strict token budget constraints. A key innovation is the instance-adaptive calibration mechanism that dynamically adjusts the relevance-redundancy trade-off parameter based on candidate pool statistics, eliminating manual tuning. Theoretical analysis proves the objective exhibits ϵ-approximate submodularity, yielding a (1−1/e−ϵ) approximation guarantee for the greedy selection algorithm. Experiments on HotpotQA, Natural Questions, and TriviaQA demonstrate 47% redundancy reduction, 3.2-point accuracy improvement, and 2.3× faster inference compared to uncompressed baselines.

PAPER · PDF

v1_draft.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

ATRCS reduces context redundancy by 47% compared to strong baselines while simultaneously improving answer accuracy by 3.2 points on question answering benchmarks.

The instance-adaptive calibration mechanism automatically adjusts the relevance-redundancy trade-off based on candidate pool statistics and available token budget, eliminating the need for manual parameter tuning required by MMR.

The objective function exhibits ϵ-approximate submodularity under realistic embedding similarity conditions, providing a (1−1/e−ϵ) theoretical approximation guarantee for greedy selection under knapsack constraints.

The method achieves 2.3× faster end-to-end inference than uncompressed RAG while maintaining or improving generation quality through efficient token utilization.

Standard top-k retrieval wastes precious token budget on redundant or near-duplicate chunks, which degrades multi-hop reasoning capabilities and can lead to hallucinations.

Limitations & open questions

The theoretical guarantees assume specific conditions on embedding similarity distributions that may not hold uniformly across all embedding models or domains.

The greedy selection algorithm, while computationally efficient, provides only an approximation guarantee rather than the global optimum achievable by more expensive methods like DPP.

Evaluation is limited to question answering datasets (HotpotQA, Natural Questions, TriviaQA); performance on other RAG applications such as long-form summarization or code generation remains unverified.

Adaptive Token Budgeting for Redundancy-Aware RAG

Key findings

Limitations & open questions

Related Papers