Table of Links Abstract and 1 Introduction 2 Background & Problem Statement 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 3 DreamLLM 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.2 Model Training 4 Experiments and 4.1 Multimodal Comprehension 4.2 Text-Conditional Image Synthesis 4.3 Multimodal Joint Creation & Comprehension 5 Discussions 5.1 Synergy between creation & Comprehension? 5. 2 What is learned by DreamLLM? 6 Related Works 7 Conclusions and References A Additional Experiments B Additional Qualitative Examples C Implementation Details D Additional Related Works E Limitations, Failure Cases & Future Works B ADDITIONAL QUALITATIVE EXAMPLES Text-condition Image Synthesis In Fig. 10 and Fig. 11, we show the image examples of DREAMLLM using the same prompts from previous works for a cross reference and comparison, including DALL-E (Ramesh et al., 2021), DALL-E 2 (i.e., unCLIP) (Ramesh et al., 2022), GLIDE (Nichol et al., 2022), Imagen (Saharia et al., 2022), and Parti (Yu et al., 2022b). Similar to Parti, we have extended some prompts with new sub-prompts for constructing more examples from different prompts. Multimodal Dialogue In Tables 9 and 10, we present a comparative analysis of visual question answering results between our model, DREAMLLM, and other state-of-the-art models: GPT-4 (OpenAI, 2023), LLaVA (Liu et al., 2023a), BLIP-2 (Li et al., 2022), and OpenFlamingo (Awadalla et al., 2023b). The key findings are as follows: i) DREAMLLM surpasses GPT-4 in providing more detailed and precise responses to given questions. ii) While LLaVA (Liu et al., 2023a) also offers detailed responses, it frequently introduces imaginary elements not present in the image. In contrast, DREAMLLM delivers more accurate answers, effectively avoiding this visual hallucination issue. This observation aligns with our earlier findings in Table 7, which underscore the robustness of DREAMLLM against visual hallucination. Furthermore, we showcase additional qualitative results of the multimodal dialogue in Fig. 7, Fig. 8, and Fig. 9. These figures illustrate DREAMLLM’s proficiency in comprehending and generating long-context multimodal information in various input and output formats. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. Authors:
(1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII;
(2) Chunrui Han, MEGVII Technology;
(3) Yuang Peng, Tsinghua University and Internship at MEGVII;
(4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII;
(5) Zheng Ge, MEGVII Technology;
(6) Jinrong Yang, HUST and Internship at MEGVII;
(7) Liang Zhao, MEGVII Technology;
(8) Jianjian Sun, MEGVII Technology;
(9) Hongyu Zhou, MEGVII Technology;
(10) Haoran Wei, MEGVII Technology;
(11) Xiangwen Kong, MEGVII Technology;
(12) Xiangyu Zhang, MEGVII Technology and a Project leader;
(13) Kaisheng Ma, Tsinghua University and a Corresponding author;
(14) Li Yi, Tsinghua University, a Corresponding authors and Project leader. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Background & Problem Statement 2 Background & Problem Statement 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 3 DreamLLM 3 DreamLLM 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.2 Model Training 3.2 Model Training 4 Experiments and 4.1 Multimodal Comprehension 4 Experiments and 4.1 Multimodal Comprehension 4.2 Text-Conditional Image Synthesis 4.2 Text-Conditional Image Synthesis 4.3 Multimodal Joint Creation & Comprehension 4.3 Multimodal Joint Creation & Comprehension 5 Discussions 5.1 Synergy between creation & Comprehension? 5.1 Synergy between creation & Comprehension? 5. 2 What is learned by DreamLLM? 5. 2 What is learned by DreamLLM? 6 Related Works 6 Related Works 7 Conclusions and References 7 Conclusions and References A Additional Experiments A Additional Experiments B Additional Qualitative Examples B Additional Qualitative Examples C Implementation Details C Implementation Details D Additional Related Works D Additional Related Works E Limitations, Failure Cases & Future Works E Limitations, Failure Cases & Future Works B ADDITIONAL QUALITATIVE EXAMPLES Text-condition Image Synthesis In Fig. 10 and Fig. 11, we show the image examples of DREAMLLM using the same prompts from previous works for a cross reference and comparison, including DALL-E (Ramesh et al., 2021), DALL-E 2 (i.e., unCLIP) (Ramesh et al., 2022), GLIDE (Nichol et al., 2022), Imagen (Saharia et al., 2022), and Parti (Yu et al., 2022b). Similar to Parti, we have extended some prompts with new sub-prompts for constructing more examples from different prompts. Text-condition Image Synthesis Multimodal Dialogue In Tables 9 and 10, we present a comparative analysis of visual question answering results between our model, DREAMLLM, and other state-of-the-art models: GPT-4 (OpenAI, 2023), LLaVA (Liu et al., 2023a), BLIP-2 (Li et al., 2022), and OpenFlamingo (Awadalla et al., 2023b). Multimodal Dialogue The key findings are as follows: i) DREAMLLM surpasses GPT-4 in providing more detailed and precise responses to given questions. ii) While LLaVA (Liu et al., 2023a) also offers detailed responses, it frequently introduces imaginary elements not present in the image. In contrast, DREAMLLM delivers more accurate answers, effectively avoiding this visual hallucination issue. This observation aligns with our earlier findings in Table 7, which underscore the robustness of DREAMLLM against visual hallucination. Furthermore, we showcase additional qualitative results of the multimodal dialogue in Fig. 7, Fig. 8, and Fig. 9. These figures illustrate DREAMLLM’s proficiency in comprehending and generating long-context multimodal information in various input and output formats. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv Authors: (1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII; (2) Chunrui Han, MEGVII Technology; (3) Yuang Peng, Tsinghua University and Internship at MEGVII; (4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII; (5) Zheng Ge, MEGVII Technology; (6) Jinrong Yang, HUST and Internship at MEGVII; (7) Liang Zhao, MEGVII Technology; (8) Jianjian Sun, MEGVII Technology; (9) Hongyu Zhou, MEGVII Technology; (10) Haoran Wei, MEGVII Technology; (11) Xiangwen Kong, MEGVII Technology; (12) Xiangyu Zhang, MEGVII Technology and a Project leader; (13) Kaisheng Ma, Tsinghua University and a Corresponding author; (14) Li Yi, Tsinghua University, a Corresponding authors and Project leader. Authors: Authors: (1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII; (2) Chunrui Han, MEGVII Technology; (3) Yuang Peng, Tsinghua University and Internship at MEGVII; (4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII; (5) Zheng Ge, MEGVII Technology; (6) Jinrong Yang, HUST and Internship at MEGVII; (7) Liang Zhao, MEGVII Technology; (8) Jianjian Sun, MEGVII Technology; (9) Hongyu Zhou, MEGVII Technology; (10) Haoran Wei, MEGVII Technology; (11) Xiangwen Kong, MEGVII Technology; (12) Xiangyu Zhang, MEGVII Technology and a Project leader; (13) Kaisheng Ma, Tsinghua University and a Corresponding author; (14) Li Yi, Tsinghua University, a Corresponding authors and Project leader.

DreamLLM: Additional Qualitative Examples That Show Off Its Power

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

What Is DreamLLM? Everything You Need to Know About the Learning Framework

Get to Know More About DreamLLM: The Background on This Learning Framework

Using MLLMs for Diffusion Synthesis That Synergizes Both Sides: How Is This Possible?

How DreamLLM Generates an Image On Its Own "Free Will"

The Model Training DreamLLM Underwent: Its Origin Story

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

What Is DreamLLM? Everything You Need to Know About the Learning Framework

Get to Know More About DreamLLM: The Background on This Learning Framework

Using MLLMs for Diffusion Synthesis That Synergizes Both Sides: How Is This Possible?

How DreamLLM Generates an Image On Its Own "Free Will"

The Model Training DreamLLM Underwent: Its Origin Story

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps