Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models

CVPR 2023

Jiale Xu1,3, Xintao Wang1, Weihao Cheng1, Yan-Pei Cao1, Ying Shan1, Xiaohu Qie2, Shenghua Gao3,4,5
1ARC Lab, 2Tencent PCG 3ShanghaiTech University
4Shanghai Engineering Research Center of Intelligent Vision and Imaging
5Shanghai Engineering Research Center of Energy Efficient and Custom AI IC


Recent CLIP-guided 3D optimization methods, e.g., DreamFields and PureCLIPNeRF achieve great success in zero-shot text-guided 3D synthesis. However, due to the scratch training and random initialization without any prior knowledge, these methods usually fail to generate accurate and faithful 3D structures that conform to the corresponding text. In this paper, we make the first attempt to introduce the explicit 3D shape prior to CLIP-guided 3D optimization methods. Specifically, we first generate a high-quality 3D shape from input texts in the text-to-shape stage as the 3D shape prior. We then utilize it as the initialization of a neural radiance field and then optimize it with the full prompt. For the text-to-shape generation, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between images synthesized by the text-to-image model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, namely, Dream3D, is capable of generating imaginative 3D content with better visual quality and shape accuracy than state-of-the-art methods.


Overview of our text-guided 3D synthesis framework. (a) In the first text-to-shape stage, we leverage a fine-tuned off-the-shelf text-to-image diffusion model $G_I$ to synthesize a rendering-style image $\boldsymbol{I}_r$ from the input text prompt $y$. Then we generate a latent shape embedding $\boldsymbol{e}_S$ from $\boldsymbol{I}_r$ using a diffusion-model-based shape embedding generation network $G_M$. Finally, we feed $\boldsymbol{e}_S$ into a high-quality 3D shape generator $G_S$ to synthesize a 3D shape $S$, which we use as an explicit 3D shape prior. (b) In the second optimization stage, we use the 3D shape prior $S$ to initialize a neural radiance field and further optimize it with the CLIP guidance to synthesize 3D content consistent with the input text prompt $y$.


"A minecraft car."

"A car from the Cars movie."

"The Iron Throne in Game of Thrones."

"A cabinet designed by van Gogh."

"A lamp imitating sunflower."

"A car is burning."

"A fishing boat floating on the water."

"A chair imitating cactus."

"An aircraft carrier."

CLIP-Mesh Dream Fields PureCLIPNeRF Ours


    title     = {Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models},
    author    = {Xu, Jiale and Wang, Xintao and Cheng, Weihao and Cao, Yan-Pei and Shan, Ying and Qie, Xiaohu and Gao, Shenghua},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages     = {20908--20918},
    year      = {2023}