This paper presents Dyna3D-GAN, a generative adversarial framework for real-time text-to-3D synthesis using hierarchical prompt encoding and dynamic tri-plane generation. The approach combines 3D Gaussian Splatting with a multi-view discriminator to achieve high-fidelity rendering without per-prompt optimization. Extensive experiments demonstrate 100× faster inference compared to Score Distillation Sampling methods while maintaining geometric consistency and capturing complex textual descriptions including spatial relationships.
Key findings
Dyna3D-GAN achieves 100× faster inference compared to optimization-based SDS methods while maintaining comparable or superior generation quality.
The hierarchical prompt encoder processes text at sentence, phrase, and word levels to capture fine-grained attributes and spatial relationships.
Dynamic tri-plane generator adaptively allocates computational capacity based on prompt complexity through cross-attention mechanisms.
Multi-scale Gaussian decoder leverages 3D Gaussian Splatting for real-time high-fidelity rendering.
Multi-view discriminator enforces geometric consistency across viewpoints without requiring explicit 3D supervision during training.
Limitations & open questions
Potential challenges with extremely complex multi-object scenes containing intricate spatial interactions.
Inherent limitations of GAN architectures regarding training stability and mode diversity.
Dependency on the availability of high-quality text-3D paired training data.