Table of Links Abstract and 1 Introduction 2 Background & Problem Statement 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 3 DreamLLM 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.2 Model Training 4 Experiments and 4.1 Multimodal Comprehension 4.2 Text-Conditional Image Synthesis 4.3 Multimodal Joint Creation & Comprehension 5 Discussions 5.1 Synergy between creation & Comprehension? 5. 2 What is learned by DreamLLM? 6 Related Works 7 Conclusions and References A Additional Experiments B Additional Qualitative Examples C Implementation Details D Additional Related Works E Limitations, Failure Cases & Future Works 6 RELATED WORKS Rapid developments have been witnessed in extending LLMs like LLaMA (Touvron et al., 2023a) to multimodal comprehension that enables human interaction with both words and visual content. One line of work is built by system integration of LLMs with various functioning agents where language acts as general interface (Wu et al., 2023; Gupta & Kembhavi, 2023; Yang et al., 2023b; Liang et al., 2023; Shen et al., 2023; Yang et al., 2023a; Surís et al., 2023), and remarkable success has been demonstrated in such plugin-style frameworks. Another line of work instead explores training LLMs to consume and understand multimodal inputs (Hao et al., 2022; Huang et al., 2023; Chen et al., 2023b) with parameter-efficient tuning (Hu et al., 2022; Alayrac et al., 2022; Li et al., 2023b; Zhang et al., 2023d; Zhu et al., 2023a; Ye et al., 2023) and instruction tuning (Xu et al., 2023; Liu et al., 2023a; Dai et al., 2023). More recently, some approaches have been developed towards visual-interactive multimodal comprehension by precise referring instruction tuning (Zhao et al., 2023a; Peng et al., 2023; Chen et al., 2023a; Zhang et al., 2023f). For cross-modal creation, early works generally tokenize the visual contents into discrete VQ codebooks (van den Oord et al., 2017; Wang et al., 2022; Lu et al., 2023; Diao et al., 2023; Yu et al., 2023a). Recent works instead explore incorporating MLLMs for image synthesis using text-to-image models such as Stable Diffusion, and the objective is to generate conditional embeddings that align pretrained CLIP text (i.e., CLIP) or CLIP variant embeddings (Koh et al., 2023; Ge et al., 2023; Sun et al., 2023a;b). Authors:
(1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII;
(2) Chunrui Han, MEGVII Technology;
(3) Yuang Peng, Tsinghua University and Internship at MEGVII;
(4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII;
(5) Zheng Ge, MEGVII Technology;
(6) Jinrong Yang, HUST and Internship at MEGVII;
(7) Liang Zhao, MEGVII Technology;
(8) Jianjian Sun, MEGVII Technology;
(9) Hongyu Zhou, MEGVII Technology;
(10) Haoran Wei, MEGVII Technology;
(11) Xiangwen Kong, MEGVII Technology;
(12) Xiangyu Zhang, MEGVII Technology and a Project leader;
(13) Kaisheng Ma, Tsinghua University and a Corresponding author;
(14) Li Yi, Tsinghua University, a Corresponding authors and Project leader. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Background & Problem Statement 2 Background & Problem Statement 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 3 DreamLLM 3 DreamLLM 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.2 Model Training 3.2 Model Training 4 Experiments and 4.1 Multimodal Comprehension 4 Experiments and 4.1 Multimodal Comprehension 4.2 Text-Conditional Image Synthesis 4.2 Text-Conditional Image Synthesis 4.3 Multimodal Joint Creation & Comprehension 4.3 Multimodal Joint Creation & Comprehension 5 Discussions 5.1 Synergy between creation & Comprehension? 5.1 Synergy between creation & Comprehension? 5. 2 What is learned by DreamLLM? 5. 2 What is learned by DreamLLM? 6 Related Works 6 Related Works 7 Conclusions and References 7 Conclusions and References A Additional Experiments A Additional Experiments B Additional Qualitative Examples B Additional Qualitative Examples C Implementation Details C Implementation Details D Additional Related Works D Additional Related Works E Limitations, Failure Cases & Future Works E Limitations, Failure Cases & Future Works 6 RELATED WORKS Rapid developments have been witnessed in extending LLMs like LLaMA (Touvron et al., 2023a) to multimodal comprehension that enables human interaction with both words and visual content. One line of work is built by system integration of LLMs with various functioning agents where language acts as general interface (Wu et al., 2023; Gupta & Kembhavi, 2023; Yang et al., 2023b; Liang et al., 2023; Shen et al., 2023; Yang et al., 2023a; Surís et al., 2023), and remarkable success has been demonstrated in such plugin-style frameworks. Another line of work instead explores training LLMs to consume and understand multimodal inputs (Hao et al., 2022; Huang et al., 2023; Chen et al., 2023b) with parameter-efficient tuning (Hu et al., 2022; Alayrac et al., 2022; Li et al., 2023b; Zhang et al., 2023d; Zhu et al., 2023a; Ye et al., 2023) and instruction tuning (Xu et al., 2023; Liu et al., 2023a; Dai et al., 2023). More recently, some approaches have been developed towards visual-interactive multimodal comprehension by precise referring instruction tuning (Zhao et al., 2023a; Peng et al., 2023; Chen et al., 2023a; Zhang et al., 2023f). For cross-modal creation, early works generally tokenize the visual contents into discrete VQ codebooks (van den Oord et al., 2017; Wang et al., 2022; Lu et al., 2023; Diao et al., 2023; Yu et al., 2023a). Recent works instead explore incorporating MLLMs for image synthesis using text-to-image models such as Stable Diffusion, and the objective is to generate conditional embeddings that align pretrained CLIP text (i.e., CLIP) or CLIP variant embeddings (Koh et al., 2023; Ge et al., 2023; Sun et al., 2023a;b). Authors: (1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII; (2) Chunrui Han, MEGVII Technology; (3) Yuang Peng, Tsinghua University and Internship at MEGVII; (4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII; (5) Zheng Ge, MEGVII Technology; (6) Jinrong Yang, HUST and Internship at MEGVII; (7) Liang Zhao, MEGVII Technology; (8) Jianjian Sun, MEGVII Technology; (9) Hongyu Zhou, MEGVII Technology; (10) Haoran Wei, MEGVII Technology; (11) Xiangwen Kong, MEGVII Technology; (12) Xiangyu Zhang, MEGVII Technology and a Project leader; (13) Kaisheng Ma, Tsinghua University and a Corresponding author; (14) Li Yi, Tsinghua University, a Corresponding authors and Project leader. Authors: Authors: (1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII; (2) Chunrui Han, MEGVII Technology; (3) Yuang Peng, Tsinghua University and Internship at MEGVII; (4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII; (5) Zheng Ge, MEGVII Technology; (6) Jinrong Yang, HUST and Internship at MEGVII; (7) Liang Zhao, MEGVII Technology; (8) Jianjian Sun, MEGVII Technology; (9) Hongyu Zhou, MEGVII Technology; (10) Haoran Wei, MEGVII Technology; (11) Xiangwen Kong, MEGVII Technology; (12) Xiangyu Zhang, MEGVII Technology and a Project leader; (13) Kaisheng Ma, Tsinghua University and a Corresponding author; (14) Li Yi, Tsinghua University, a Corresponding authors and Project leader. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv

If You Like DreamLLM, Check These Works Out

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Unlocking Synergy: Advancements in Training Multimodal Language Models for Vision and Text

What Is DreamLLM? Everything You Need to Know About the Learning Framework

Get to Know More About DreamLLM: The Background on This Learning Framework

Using MLLMs for Diffusion Synthesis That Synergizes Both Sides: How Is This Possible?

How DreamLLM Generates an Image On Its Own "Free Will"

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Unlocking Synergy: Advancements in Training Multimodal Language Models for Vision and Text

What Is DreamLLM? Everything You Need to Know About the Learning Framework

Get to Know More About DreamLLM: The Background on This Learning Framework

Using MLLMs for Diffusion Synthesis That Synergizes Both Sides: How Is This Possible?

How DreamLLM Generates an Image On Its Own "Free Will"

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps