GIRT-X: Extending Grounded Instruction Tuning to Multi-Mo...

ABSTRACT

This paper introduces GIRT-X, a framework extending GRIT to evaluate multimodal responses combining natural language, visual grounding, and executable code with execution traces. It includes a unified representation for hybrid responses, a multimodal execution environment, a novel evaluation protocol, and GIRT-X-1M, a large-scale dataset with rich hierarchical knowledge.

PAPER · PDF

manuscript.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

GIRT-X integrates code execution traces as a core modality in multimodal responses.

A hybrid representation is introduced for seamless integration of text, region coordinates, and executable code.

A sandboxed execution environment captures detailed execution traces for analysis.

A multi-dimensional evaluation framework assesses correctness across text, visual grounding, code, and execution traces.

GIRT-X-1M dataset extends GRIT's hierarchical structure to include code execution tasks with rich annotations.

Limitations & open questions

The paper does not discuss potential limitations or failure modes in detail.

GIRT-X: Extending Grounded Instruction Tuning to Multi-Modal Responses with Code Execution Traces

Key findings

Limitations & open questions

Related Papers