Table of Links Abstract and 1 Introduction 2 Background & Problem Statement 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 3 DreamLLM 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.2 Model Training 4 Experiments and 4.1 Multimodal Comprehension 4.2 Text-Conditional Image Synthesis 4.3 Multimodal Joint Creation & Comprehension 5 Discussions 5.1 Synergy between creation & Comprehension? 5. 2 What is learned by DreamLLM? 6 Related Works 7 Conclusions and References A Additional Experiments B Additional Qualitative Examples C Implementation Details D Additional Related Works E Limitations, Failure Cases & Future Works 4 EXPERIMENTS DREAMLLM is a versatile multimodal generalist that excels at zero-shot or in-context vision language comprehension and synthesis tasks. In this section, we conduct systematic evaluations for demonstration. See qualitative results in Appendix B and implementation details in Appendix C. 4.1 MULTIMODAL COMPREHENSION Multimodal comprehension enables humans to interact with agents conditioned on both words and visual content. We evaluate the multimodal vision and language capabilities of DREAMLLM across several benchmarks, including image-to-text captioning on COCO (Karpathy & Fei-Fei, 2017) and Image2Paragraph (Krause et al., 2017), general visual question answering (VQA) on VQAv2 (Goyal et al., 2019), OKVQA (Marino et al., 2019), VizWiz (Gurari et al., 2018), and text-related VQA on TextVQA (Singh et al., 2019). Additionally, we conducted a zero-shot evaluation on the recently developed benchmarks of MMBench and MM-Vet to assess the model’s performance in complex multimodal tasks. The results are presented in Table 1 (See Table 5, and Table 6 in Appendix A). All metrics and data splits are listed in Table 12 in Appendix C. We find that i) DREAMLLM outperforms other MLLMs across all benchmarks. Notably, DREAMLLM-7B surpasses concurrent MLLMs with image synthesis capabilities by a significant margin, achieving +16.6 higher accuracy on VQAv2 compared to Emu-13B. ii) On comprehensive benchmarks like MMBench and MM-Vet, DREAMLLM achieves state-of-the-art performance against all 7B counterparts. Detailed analysis revealed superior spatial/relation reasoning capabilities in DREAMLLM compared to other MLLMs, likely a result of its image synthesis learning. See qualitative results and comparisons on multimodal dialogue in Table 9, Table 10, Fig. 7, Fig. 8, and Fig. 9, in Appendix B. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. Authors:
(1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII;
(2) Chunrui Han, MEGVII Technology;
(3) Yuang Peng, Tsinghua University and Internship at MEGVII;
(4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII;
(5) Zheng Ge, MEGVII Technology;
(6) Jinrong Yang, HUST and Internship at MEGVII;
(7) Liang Zhao, MEGVII Technology;
(8) Jianjian Sun, MEGVII Technology;
(9) Hongyu Zhou, MEGVII Technology;
(10) Haoran Wei, MEGVII Technology;
(11) Xiangwen Kong, MEGVII Technology;
(12) Xiangyu Zhang, MEGVII Technology and a Project leader;
(13) Kaisheng Ma, Tsinghua University and a Corresponding author;
(14) Li Yi, Tsinghua University, a Corresponding authors and Project leader. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Background & Problem Statement 2 Background & Problem Statement 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 3 DreamLLM 3 DreamLLM 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.2 Model Training 3.2 Model Training 4 Experiments and 4.1 Multimodal Comprehension 4 Experiments and 4.1 Multimodal Comprehension 4.2 Text-Conditional Image Synthesis 4.2 Text-Conditional Image Synthesis 4.3 Multimodal Joint Creation & Comprehension 4.3 Multimodal Joint Creation & Comprehension 5 Discussions 5.1 Synergy between creation & Comprehension? 5.1 Synergy between creation & Comprehension? 5. 2 What is learned by DreamLLM? 5. 2 What is learned by DreamLLM? 6 Related Works 6 Related Works 7 Conclusions and References 7 Conclusions and References A Additional Experiments A Additional Experiments B Additional Qualitative Examples B Additional Qualitative Examples C Implementation Details C Implementation Details D Additional Related Works D Additional Related Works E Limitations, Failure Cases & Future Works E Limitations, Failure Cases & Future Works 4 EXPERIMENTS DREAMLLM is a versatile multimodal generalist that excels at zero-shot or in-context vision language comprehension and synthesis tasks. In this section, we conduct systematic evaluations for demonstration. See qualitative results in Appendix B and implementation details in Appendix C. 4.1 MULTIMODAL COMPREHENSION Multimodal comprehension enables humans to interact with agents conditioned on both words and visual content. We evaluate the multimodal vision and language capabilities of DREAMLLM across several benchmarks, including image-to-text captioning on COCO (Karpathy & Fei-Fei, 2017) and Image2Paragraph (Krause et al., 2017), general visual question answering (VQA) on VQAv2 (Goyal et al., 2019), OKVQA (Marino et al., 2019), VizWiz (Gurari et al., 2018), and text-related VQA on TextVQA (Singh et al., 2019). Additionally, we conducted a zero-shot evaluation on the recently developed benchmarks of MMBench and MM-Vet to assess the model’s performance in complex multimodal tasks. The results are presented in Table 1 (See Table 5, and Table 6 in Appendix A). All metrics and data splits are listed in Table 12 in Appendix C. We find that i) DREAMLLM outperforms other MLLMs across all benchmarks. Notably, DREAMLLM-7B surpasses concurrent MLLMs with image synthesis capabilities by a significant margin, achieving +16.6 higher accuracy on VQAv2 compared to Emu-13B. ii) On comprehensive benchmarks like MMBench and MM-Vet, DREAMLLM achieves state-of-the-art performance against all 7B counterparts. Detailed analysis revealed superior spatial/relation reasoning capabilities in DREAMLLM compared to other MLLMs, likely a result of its image synthesis learning. See qualitative results and comparisons on multimodal dialogue in Table 9, Table 10, Fig. 7, Fig. 8, and Fig. 9, in Appendix B. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv Authors: (1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII; (2) Chunrui Han, MEGVII Technology; (3) Yuang Peng, Tsinghua University and Internship at MEGVII; (4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII; (5) Zheng Ge, MEGVII Technology; (6) Jinrong Yang, HUST and Internship at MEGVII; (7) Liang Zhao, MEGVII Technology; (8) Jianjian Sun, MEGVII Technology; (9) Hongyu Zhou, MEGVII Technology; (10) Haoran Wei, MEGVII Technology; (11) Xiangwen Kong, MEGVII Technology; (12) Xiangyu Zhang, MEGVII Technology and a Project leader; (13) Kaisheng Ma, Tsinghua University and a Corresponding author; (14) Li Yi, Tsinghua University, a Corresponding authors and Project leader. Authors: Authors: (1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII; (2) Chunrui Han, MEGVII Technology; (3) Yuang Peng, Tsinghua University and Internship at MEGVII; (4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII; (5) Zheng Ge, MEGVII Technology; (6) Jinrong Yang, HUST and Internship at MEGVII; (7) Liang Zhao, MEGVII Technology; (8) Jianjian Sun, MEGVII Technology; (9) Hongyu Zhou, MEGVII Technology; (10) Haoran Wei, MEGVII Technology; (11) Xiangwen Kong, MEGVII Technology; (12) Xiangyu Zhang, MEGVII Technology and a Project leader; (13) Kaisheng Ma, Tsinghua University and a Corresponding author; (14) Li Yi, Tsinghua University, a Corresponding authors and Project leader.

DreamLLM Experiments: How Did it Fare?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Unlocking Synergy: Advancements in Training Multimodal Language Models for Vision and Text

What Is DreamLLM? Everything You Need to Know About the Learning Framework

Get to Know More About DreamLLM: The Background on This Learning Framework

Using MLLMs for Diffusion Synthesis That Synergizes Both Sides: How Is This Possible?

How DreamLLM Generates an Image On Its Own "Free Will"

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Unlocking Synergy: Advancements in Training Multimodal Language Models for Vision and Text

What Is DreamLLM? Everything You Need to Know About the Learning Framework

Get to Know More About DreamLLM: The Background on This Learning Framework

Using MLLMs for Diffusion Synthesis That Synergizes Both Sides: How Is This Possible?

How DreamLLM Generates an Image On Its Own "Free Will"

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps