Authors:
(1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII;
(2) Chunrui Han, MEGVII Technology;
(3) Yuang Peng, Tsinghua University and Internship at MEGVII;
(4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII;
(5) Zheng Ge, MEGVII Technology;
(6) Jinrong Yang, HUST and Internship at MEGVII;
(7) Liang Zhao, MEGVII Technology;
(8) Jianjian Sun, MEGVII Technology;
(9) Hongyu Zhou, MEGVII Technology;
(10) Haoran Wei, MEGVII Technology;
(11) Xiangwen Kong, MEGVII Technology;
(12) Xiangyu Zhang, MEGVII Technology and a Project leader;
(13) Kaisheng Ma, Tsinghua University and a Corresponding author;
(14) Li Yi, Tsinghua University, a Corresponding authors and Project leader. Table of Links Abstract and 1 Introduction 2 Background & Problem Statement 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 3 DreamLLM 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.2 Model Training 4 Experiments and 4.1 Multimodal Comprehension 4.2 Text-Conditional Image Synthesis 4.3 Multimodal Joint Creation & Comprehension 5 Discussions 5.1 Synergy between creation & Comprehension? 5. 2 What is learned by DreamLLM? 6 Related Works 7 Conclusions and References A Additional Experiments B Additional Qualitative Examples C Implementation Details D Additional Related Works E Limitations, Failure Cases & Future Works 3.2 MODEL TRAINING In this work, we consider a three-stage training procedure. It can be summarized as follows, and the implementation details, like training data, can be found in Table 11 in Appendix C. I Alignment Training This stage is used to alleviate the gap in multimodality, facilitating the adaptation of multimodal inputs to LLMs. The linear visual projector, linear condition projector, and learnable dream embeddings are pretrained for cross-modal manifold alignment among frozen LLMs, visual encoder, and SD. We use approximately 30M image-text pairs data, training both image-to-text comprehension and text-to-image synthesis. II I-GPT Pretraining Following alignment, the LLM undergoes an unfrozen process for I-GPT pretraining (detailed in Sec. 3.1). This critical stage facilitates the learning of joint vision-language distributions via generative modeling. Training incorporates approximately 2M selectively filtered documents from MMC4-Core (Zhu et al., 2023b), adhering to a CLIP score threshold of 0.25. Furthermore, we use 2M paired data samples from LAION400M (Schuhmann et al., 2021), captioned by BLIP (Li et al., 2022) (i.e., BLIP-LAION), to enhance text-to-image training and potentially mitigate the impact of some low-quality noisy images and texts from sMMC4. III Supervised Fine-tuning This stage enables the model to perform general multimodal comprehension and creative tasks following human instructions (Ouyang et al., 2022). We utilize approximately 80K visual instruction tuning data collected by Liu et al.. For instruction-following content creation, GPT-4 (OpenAI, 2023) is prompted with document summaries or image captions, collecting approximately 20K instruction-following document synthesis from MMC4 (InstructMMC4) and 20K image synthesis data from BLIP captioned LAION400M (Instruct-BLIP-LAION). This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. Authors: (1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII; (2) Chunrui Han, MEGVII Technology; (3) Yuang Peng, Tsinghua University and Internship at MEGVII; (4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII; (5) Zheng Ge, MEGVII Technology; (6) Jinrong Yang, HUST and Internship at MEGVII; (7) Liang Zhao, MEGVII Technology; (8) Jianjian Sun, MEGVII Technology; (9) Hongyu Zhou, MEGVII Technology; (10) Haoran Wei, MEGVII Technology; (11) Xiangwen Kong, MEGVII Technology; (12) Xiangyu Zhang, MEGVII Technology and a Project leader; (13) Kaisheng Ma, Tsinghua University and a Corresponding author; (14) Li Yi, Tsinghua University, a Corresponding authors and Project leader. Authors: Authors: (1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII; (2) Chunrui Han, MEGVII Technology; (3) Yuang Peng, Tsinghua University and Internship at MEGVII; (4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII; (5) Zheng Ge, MEGVII Technology; (6) Jinrong Yang, HUST and Internship at MEGVII; (7) Liang Zhao, MEGVII Technology; (8) Jianjian Sun, MEGVII Technology; (9) Hongyu Zhou, MEGVII Technology; (10) Haoran Wei, MEGVII Technology; (11) Xiangwen Kong, MEGVII Technology; (12) Xiangyu Zhang, MEGVII Technology and a Project leader; (13) Kaisheng Ma, Tsinghua University and a Corresponding author; (14) Li Yi, Tsinghua University, a Corresponding authors and Project leader. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Background & Problem Statement 2 Background & Problem Statement 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides? 3 DreamLLM 3 DreamLLM 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.1 End-to-End Interleaved generative Pretraining (I-GPT) 3.2 Model Training 3.2 Model Training 4 Experiments and 4.1 Multimodal Comprehension 4 Experiments and 4.1 Multimodal Comprehension 4.2 Text-Conditional Image Synthesis 4.2 Text-Conditional Image Synthesis 4.3 Multimodal Joint Creation & Comprehension 4.3 Multimodal Joint Creation & Comprehension 5 Discussions 5.1 Synergy between creation & Comprehension? 5.1 Synergy between creation & Comprehension? 5. 2 What is learned by DreamLLM? 5. 2 What is learned by DreamLLM? 6 Related Works 6 Related Works 7 Conclusions and References 7 Conclusions and References A Additional Experiments A Additional Experiments B Additional Qualitative Examples B Additional Qualitative Examples C Implementation Details C Implementation Details D Additional Related Works D Additional Related Works E Limitations, Failure Cases & Future Works E Limitations, Failure Cases & Future Works 3.2 MODEL TRAINING In this work, we consider a three-stage training procedure. It can be summarized as follows, and the implementation details, like training data, can be found in Table 11 in Appendix C. I Alignment Training This stage is used to alleviate the gap in multimodality, facilitating the adaptation of multimodal inputs to LLMs. The linear visual projector, linear condition projector, and learnable dream embeddings are pretrained for cross-modal manifold alignment among frozen LLMs, visual encoder, and SD. We use approximately 30M image-text pairs data, training both image-to-text comprehension and text-to-image synthesis. Alignment Training II I-GPT Pretraining Following alignment, the LLM undergoes an unfrozen process for I-GPT pretraining (detailed in Sec. 3.1). This critical stage facilitates the learning of joint vision-language distributions via generative modeling. Training incorporates approximately 2M selectively filtered documents from MMC4-Core (Zhu et al., 2023b), adhering to a CLIP score threshold of 0.25. Furthermore, we use 2M paired data samples from LAION400M (Schuhmann et al., 2021), captioned by BLIP (Li et al., 2022) (i.e., BLIP-LAION), to enhance text-to-image training and potentially mitigate the impact of some low-quality noisy images and texts from sMMC4. I-GPT Pretraining III Supervised Fine-tuning This stage enables the model to perform general multimodal comprehension and creative tasks following human instructions (Ouyang et al., 2022). We utilize approximately 80K visual instruction tuning data collected by Liu et al.. For instruction-following content creation, GPT-4 (OpenAI, 2023) is prompted with document summaries or image captions, collecting approximately 20K instruction-following document synthesis from MMC4 (InstructMMC4) and 20K image synthesis data from BLIP captioned LAION400M (Instruct-BLIP-LAION). Supervised Fine-tuning This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv

The Model Training DreamLLM Underwent: Its Origin Story

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Increased LLM Vulnerabilities from Fine-tuning and Quantization: Abstract and Introduction

Increased LLM Vulnerabilities from Fine-tuning and Quantization: Problem Formulation and Experiments

Increased LLM Vulnerabilities from Fine-tuning and Quantization: Experiment Set-up & Results

Increased LLM Vulnerabilities from Fine-tuning and Quantization: Conclusion and References

Increased LLM Vulnerabilities from Fine-tuning and Quantization: Appendix

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Increased LLM Vulnerabilities from Fine-tuning and Quantization: Abstract and Introduction

Increased LLM Vulnerabilities from Fine-tuning and Quantization: Problem Formulation and Experiments

Increased LLM Vulnerabilities from Fine-tuning and Quantization: Experiment Set-up & Results

Increased LLM Vulnerabilities from Fine-tuning and Quantization: Conclusion and References

Increased LLM Vulnerabilities from Fine-tuning and Quantization: Appendix

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps