HybridToken: Bridging Discrete-Continuous Representations...

ABSTRACT

This paper proposes HybridToken, a novel framework that leverages hybrid discrete-continuous token representations to achieve seamless joint text and image generation. The approach decouples semantic content modeling from perceptual detail synthesis by maintaining parallel discrete and continuous token streams, coupled through cross-modal attention mechanisms. The framework includes a Dual-Stream Tokenizer and a Hybrid Transformer architecture that employs modality-aware attention to dynamically balance discrete and continuous processing based on token type. Experiments demonstrate state-of-the-art performance on multimodal benchmarks, with significant improvements in text-image alignment, image quality, and cross-modal consistency compared to discrete-only and continuous-only baselines.

PAPER · PDF

manuscript.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

HybridToken achieves state-of-the-art performance on multimodal benchmarks.

Significant improvements in text-image alignment, image quality, and cross-modal consistency.

Dual-Stream Tokenization enables simultaneous structural coherence and detail preservation.

Hybrid Transformer Architecture processes discrete and continuous tokens through modality-aware attention mechanisms.

Limitations & open questions

Further research is needed to scale the model to handle more complex multimodal interactions.

HybridToken: Bridging Discrete-Continuous Representations for Unified Multimodal Generation

Key findings

Limitations & open questions

Related Papers