This paper analyzes SageBwd, a trainable INT8 attention mechanism for long-context video pretraining, identifying QK-Normalization as crucial for stability and analyzing the sensitivity of gradients to quantization. The study proposes a method design for video transformers and validates it through extensive experiments, showing that low-bit attention can match full-precision performance with significant efficiency gains.
Key findings
QK-norm is necessary for stable training at large tokens-per-step.
The softmax gradient is identified as the primary quantization bottleneck.
Gradient noise from smaller batch sizes can mask quantization error.
K-smoothing remains essential while Q-smoothing provides limited benefit for pretraining.
Limitations & open questions
The study focuses on video pretraining and may not generalize to other domains.
The performance gap between SageBwd and full-precision attention during pretraining on language data remains a challenge.