SageBwd for Long-Context Video Pretraining with QK-Norm |...

ABSTRACT

This paper analyzes SageBwd, a trainable INT8 attention mechanism for long-context video pretraining, identifying QK-Normalization as crucial for stability and analyzing the sensitivity of gradients to quantization. The study proposes a method design for video transformers and validates it through extensive experiments, showing that low-bit attention can match full-precision performance with significant efficiency gains.

PAPER · PDF

SageBwd_LongContext_Video_Pretraining.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

QK-norm is necessary for stable training at large tokens-per-step.

The softmax gradient is identified as the primary quantization bottleneck.

Gradient noise from smaller batch sizes can mask quantization error.

K-smoothing remains essential while Q-smoothing provides limited benefit for pretraining.

Limitations & open questions

The study focuses on video pretraining and may not generalize to other domains.

The performance gap between SageBwd and full-precision attention during pretraining on language data remains a challenge.

SageBwd for Long-Context Video Pretraining with QK-Norm

Key findings

Limitations & open questions

Related Papers