This paper introduces GIRT-X, a framework extending GRIT to evaluate multimodal responses combining natural language, visual grounding, and executable code with execution traces. It includes a unified representation for hybrid responses, a multimodal execution environment, a novel evaluation protocol, and GIRT-X-1M, a large-scale dataset with rich hierarchical knowledge.
Key findings
GIRT-X integrates code execution traces as a core modality in multimodal responses.
A hybrid representation is introduced for seamless integration of text, region coordinates, and executable code.
A sandboxed execution environment captures detailed execution traces for analysis.
A multi-dimensional evaluation framework assesses correctness across text, visual grounding, code, and execution traces.
GIRT-X-1M dataset extends GRIT's hierarchical structure to include code execution tasks with rich annotations.
Limitations & open questions
The paper does not discuss potential limitations or failure modes in detail.