Table of Links Abstract and 1 Introduction 2 Background & Problem Statement 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 3 DreamLLM 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.2 Model Training 4 Experiments and 4.1 Multimodal Comprehension 4.2 Text-Conditional Image Synthesis 4.3 Multimodal Joint Creation & Comprehension 5 Discussions 5.1 Synergy between creation & Comprehension? 5. 2 What is learned by DreamLLM? 6 Related Works 7 Conclusions and References A Additional Experiments B Additional Qualitative Examples C Implementation Details D Additional Related Works E Limitations, Failure Cases & Future Works D ADDITIONAL RELATED WORKS D.1 LARGE LANGUAGE MODELS A flourishing era of Natural Language Processing (NLP) driven by LLMs is being experienced, with the parameter size growing over 100B according to the scaling law (Kaplan et al., 2020). The GPT series of models, starting with GPT-1 (Radford et al., 2018) and followed by GPT-2 (Radford et al., 2019), made significant advancements in few-shot learning by scaling up the number of parameters to 175 billion in GPT-3 (Brown et al., 2020). This breakthrough garnered a lot of attention and paved the way for further research and development in the field. Since then, researchers have focused on developing LLMs by improving the scaling strategy. Several notable efforts include Gopher (Rae et al., 2021), GaLM (Du et al., 2022), FLAN (Wei et al., 2022a), Switch Transformer (Fedus et al., 2022), Chinchilla (Hoffmann et al., 2022), and PaLM (Chowdhery et al., 2022). Besides, instruction based tuning techniques are explored for aligning with human preferences (Christiano et al., 2017; Ouyang et al., 2022). Such success of LLMs has been further solidified by the production release of ChatGPT (OpenAI, 2022) and the highly anticipated GPT-4 (OpenAI, 2023). Meanwhile, in the community, the open-source LLMs are achieving remarkable progress in language capabilities compared to their close-source counterparts. For example, OPT (Zhang et al., 2022), BLOOM (Scao et al., 2022), GLM (Zeng et al., 2023), LLaMA (Touvron et al., 2023a;b), and Falcon (Penedo et al., 2023) all raised great attention and are been widely deployed. Other methods attempt to learn from distillation, such as Alpaca (Taori et al., 2023) and Vicuna (Chiang et al., 2023). D.2 TEXT-CONDITIONAL CONTENT CREATION WITH DIFFUSION MODELS The recent surge in AI-generated content (AIGC) has been primarily driven by diffusion-based methods, particularly in the realm of text-conditional content creation. Saharia et al. (2022) have achieved astonishing advancements in high-resolution image synthesis through large-scale pretrained language models and cascaded DMs. Another paradigm, such as SD, focuses on latent spaces and demonstrates superior efficiency and performance (Rombach et al., 2022; Ramesh et al., 2022; Peebles & Xie, 2022). Recently, Lian et al. (2023) propose to enhance the reasoning capability by constructing layouts with LLMs. Motivated by the great success in 2D, a series of works have significantly propelled the 3D synthesis development (Lin et al., 2023; Wang et al., 2023c) based on Score Distillation Sampling (SDS) (Poole et al., 2023; Wang et al., 2023a) that utilizes pretrained 2D DMs. For text-to-video synthesis, the expansion of pretrained spatial to a spatial-temporal factorized U-Net with joint image and video data training has yielded significant success (Ho et al., 2022a;b). This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. Authors:
(1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII;
(2) Chunrui Han, MEGVII Technology;
(3) Yuang Peng, Tsinghua University and Internship at MEGVII;
(4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII;
(5) Zheng Ge, MEGVII Technology;
(6) Jinrong Yang, HUST and Internship at MEGVII;
(7) Liang Zhao, MEGVII Technology;
(8) Jianjian Sun, MEGVII Technology;
(9) Hongyu Zhou, MEGVII Technology;
(10) Haoran Wei, MEGVII Technology;
(11) Xiangwen Kong, MEGVII Technology;
(12) Xiangyu Zhang, MEGVII Technology and a Project leader;
(13) Kaisheng Ma, Tsinghua University and a Corresponding author;
(14) Li Yi, Tsinghua University, a Corresponding authors and Project leader. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Background & Problem Statement 2 Background & Problem Statement 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 3 DreamLLM 3 DreamLLM 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.2 Model Training 3.2 Model Training 4 Experiments and 4.1 Multimodal Comprehension 4 Experiments and 4.1 Multimodal Comprehension 4.2 Text-Conditional Image Synthesis 4.2 Text-Conditional Image Synthesis 4.3 Multimodal Joint Creation & Comprehension 4.3 Multimodal Joint Creation & Comprehension 5 Discussions 5.1 Synergy between creation & Comprehension? 5.1 Synergy between creation & Comprehension? 5. 2 What is learned by DreamLLM? 5. 2 What is learned by DreamLLM? 6 Related Works 6 Related Works 7 Conclusions and References 7 Conclusions and References A Additional Experiments A Additional Experiments B Additional Qualitative Examples B Additional Qualitative Examples C Implementation Details C Implementation Details D Additional Related Works D Additional Related Works E Limitations, Failure Cases & Future Works E Limitations, Failure Cases & Future Works D ADDITIONAL RELATED WORKS D.1 LARGE LANGUAGE MODELS A flourishing era of Natural Language Processing (NLP) driven by LLMs is being experienced, with the parameter size growing over 100B according to the scaling law (Kaplan et al., 2020). The GPT series of models, starting with GPT-1 (Radford et al., 2018) and followed by GPT-2 (Radford et al., 2019), made significant advancements in few-shot learning by scaling up the number of parameters to 175 billion in GPT-3 (Brown et al., 2020). This breakthrough garnered a lot of attention and paved the way for further research and development in the field. Since then, researchers have focused on developing LLMs by improving the scaling strategy. Several notable efforts include Gopher (Rae et al., 2021), GaLM (Du et al., 2022), FLAN (Wei et al., 2022a), Switch Transformer (Fedus et al., 2022), Chinchilla (Hoffmann et al., 2022), and PaLM (Chowdhery et al., 2022). Besides, instruction based tuning techniques are explored for aligning with human preferences (Christiano et al., 2017; Ouyang et al., 2022). Such success of LLMs has been further solidified by the production release of ChatGPT (OpenAI, 2022) and the highly anticipated GPT-4 (OpenAI, 2023). Meanwhile, in the community, the open-source LLMs are achieving remarkable progress in language capabilities compared to their close-source counterparts. For example, OPT (Zhang et al., 2022), BLOOM (Scao et al., 2022), GLM (Zeng et al., 2023), LLaMA (Touvron et al., 2023a;b), and Falcon (Penedo et al., 2023) all raised great attention and are been widely deployed. Other methods attempt to learn from distillation, such as Alpaca (Taori et al., 2023) and Vicuna (Chiang et al., 2023). D.2 TEXT-CONDITIONAL CONTENT CREATION WITH DIFFUSION MODELS The recent surge in AI-generated content (AIGC) has been primarily driven by diffusion-based methods, particularly in the realm of text-conditional content creation. Saharia et al. (2022) have achieved astonishing advancements in high-resolution image synthesis through large-scale pretrained language models and cascaded DMs. Another paradigm, such as SD, focuses on latent spaces and demonstrates superior efficiency and performance (Rombach et al., 2022; Ramesh et al., 2022; Peebles & Xie, 2022). Recently, Lian et al. (2023) propose to enhance the reasoning capability by constructing layouts with LLMs. Motivated by the great success in 2D, a series of works have significantly propelled the 3D synthesis development (Lin et al., 2023; Wang et al., 2023c) based on Score Distillation Sampling (SDS) (Poole et al., 2023; Wang et al., 2023a) that utilizes pretrained 2D DMs. For text-to-video synthesis, the expansion of pretrained spatial to a spatial-temporal factorized U-Net with joint image and video data training has yielded significant success (Ho et al., 2022a;b). This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv Authors: (1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII; (2) Chunrui Han, MEGVII Technology; (3) Yuang Peng, Tsinghua University and Internship at MEGVII; (4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII; (5) Zheng Ge, MEGVII Technology; (6) Jinrong Yang, HUST and Internship at MEGVII; (7) Liang Zhao, MEGVII Technology; (8) Jianjian Sun, MEGVII Technology; (9) Hongyu Zhou, MEGVII Technology; (10) Haoran Wei, MEGVII Technology; (11) Xiangwen Kong, MEGVII Technology; (12) Xiangyu Zhang, MEGVII Technology and a Project leader; (13) Kaisheng Ma, Tsinghua University and a Corresponding author; (14) Li Yi, Tsinghua University, a Corresponding authors and Project leader. Authors: Authors: (1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII; (2) Chunrui Han, MEGVII Technology; (3) Yuang Peng, Tsinghua University and Internship at MEGVII; (4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII; (5) Zheng Ge, MEGVII Technology; (6) Jinrong Yang, HUST and Internship at MEGVII; (7) Liang Zhao, MEGVII Technology; (8) Jianjian Sun, MEGVII Technology; (9) Hongyu Zhou, MEGVII Technology; (10) Haoran Wei, MEGVII Technology; (11) Xiangwen Kong, MEGVII Technology; (12) Xiangyu Zhang, MEGVII Technology and a Project leader; (13) Kaisheng Ma, Tsinghua University and a Corresponding author; (14) Li Yi, Tsinghua University, a Corresponding authors and Project leader.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

DreamLLM: Additional Related Works to Look Out For

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Ctrl, Alt, Imagine

Deep Fakes and Cybersecurity: How to Detect and Combat Synthetic Threats

How AI Streamlines Digital Content Creation (Writing, Images, and Videos)

How HackerNoon Feels About AI Generated Content (Hint: 👎🏽)

The State of Content Creation in the Age of Artificial Intelligence

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Ctrl, Alt, Imagine

Deep Fakes and Cybersecurity: How to Detect and Combat Synthetic Threats

How AI Streamlines Digital Content Creation (Writing, Images, and Videos)

How HackerNoon Feels About AI Generated Content (Hint: 👎🏽)

The State of Content Creation in the Age of Artificial Intelligence

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps