This research addresses the challenge of distributing importance scores across both spatial and temporal dimensions in video understanding while preserving the model's reasoning structure. The paper introduces Temporal-USU (T-USU), a method that formulates temporal attribution as a mass redistribution problem over time segments, governed by the model's temporal reasoning patterns. T-USU achieves significant improvements in temporal faithfulness metrics over baseline interpolation methods.
Key findings
T-USU redistributes attribution mass across temporal segments proportionally to their relevance scores.
T-USU preserves total importance while respecting temporal structure.
T-USU achieves 2–3x improvement in temporal faithfulness metrics over baseline interpolation methods.
T-USU enables temporally coherent explanations that reveal when models attend to critical action phases.
Limitations & open questions
The research focuses on controlled synthetic tasks and standard video benchmarks, which may not cover all real-world complexities.