This paper introduces Conservative Q-Learning with Adaptive Uncertainty Quantification (CQL-AUQ) to address non-stationarity in wireless communication systems due to fading channels. CQL-AUQ disentangles aleatoric and epistemic uncertainties using deep ensembles and introduces an adaptive conservative penalty that scales with estimated epistemic uncertainty, ensuring safe policy improvement with bounded regret.
Key findings
CQL-AUQ addresses non-stationarity in wireless channels through principled uncertainty estimation and adaptive conservative value learning.
The approach disentangles aleatoric and epistemic uncertainties, enabling the agent to distinguish between inherent environmental stochasticity and knowledge gaps.
An adaptive conservative penalty scales with estimated epistemic uncertainty, allowing appropriate conservatism in uncertain channel states.
Theoretical analysis shows CQL-AUQ achieves safe policy improvement with bounded regret under non-stationary fading dynamics.
Limitations & open questions
The paper does not discuss the computational complexity of the proposed CQL-AUQ framework.
The effectiveness of CQL-AUQ is yet to be empirically validated on real-world wireless systems.