This paper proposes HybridToken, a novel framework that leverages hybrid discrete-continuous token representations to achieve seamless joint text and image generation. The approach decouples semantic content modeling from perceptual detail synthesis by maintaining parallel discrete and continuous token streams, coupled through cross-modal attention mechanisms. The framework includes a Dual-Stream Tokenizer and a Hybrid Transformer architecture that employs modality-aware attention to dynamically balance discrete and continuous processing based on token type. Experiments demonstrate state-of-the-art performance on multimodal benchmarks, with significant improvements in text-image alignment, image quality, and cross-modal consistency compared to discrete-only and continuous-only baselines.
Key findings
HybridToken achieves state-of-the-art performance on multimodal benchmarks.
Significant improvements in text-image alignment, image quality, and cross-modal consistency.
Dual-Stream Tokenization enables simultaneous structural coherence and detail preservation.
Hybrid Transformer Architecture processes discrete and continuous tokens through modality-aware attention mechanisms.
Limitations & open questions
Further research is needed to scale the model to handle more complex multimodal interactions.