paint-brush
Here's How We Built DreamLLM: All of Its Componentsby@textmodels

Here's How We Built DreamLLM: All of Its Components

tldt arrow

Too Long; Didn't Read

trained on ShareGPT (Zheng et al., 2023). We adopt OpenAI’s CLIP-Large (Radford et al., 2021) as the visual encoder Hϕ, followed by a linear layer Mζ for visual embedding projection. To synthesize images, we use Stable Diffusion (SD) (Rombach et al., 2022) as the image decoder, and the condition projector Mψ is also a linear layer. An overview of the architecture is depicted in Fig. 2.
featured image - Here's How We Built DreamLLM: All of Its Components
Writings, Papers and Blogs on Text Models HackerNoon profile picture

Abstract and 1 Introduction

2 Background & Problem Statement

2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides?

3 DreamLLM

3.1 End-to-End Interleaved generative Pretraining (I-GPT)

3.2 Model Training

4 Experiments and 4.1 Multimodal Comprehension

4.2 Text-Conditional Image Synthesis

4.3 Multimodal Joint Creation & Comprehension

5 Discussions

5.1 Synergy between creation & Comprehension?

5. 2 What is learned by DreamLLM?

6 Related Works

7 Conclusions and References


A Additional Experiments

B Additional Qualitative Examples

C Implementation Details

D Additional Related Works

E Limitations, Failure Cases & Future Works

3 DREAMLLM

We introduce DREAMLLM, a universal learning framework that facilitates both MLLM’s comprehension and creation capabilities. Our DREAMLLM is built with a causal decoder-only LLM Fθ as the model foundation, i.e., Vicuna (Chiang et al., 2023) based on LLaMA (Touvron et al., 2023a)


Figure 2: Overview of of our DREAMLLM framework. Interleaved documents serve as input, decoded to produce outputs. Both text and images are encoded into sequential, discrete token embeddings for the MLLM input. A special  token predicts where to generate images. Subsequently, a series of dream queries are fed into the MLLM, capturing holistic historical semantics. The images are synthesized by the SD image decoder conditioned on queried semantics. The synthesized images are then fed back into the MLLM for subsequent comprehension.


trained on ShareGPT (Zheng et al., 2023). We adopt OpenAI’s CLIP-Large (Radford et al., 2021) as the visual encoder Hϕ, followed by a linear layer Mζ for visual embedding projection. To synthesize images, we use Stable Diffusion (SD) (Rombach et al., 2022) as the image decoder, and the condition projector Mψ is also a linear layer. An overview of the architecture is depicted in Fig. 2.


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII;

(2) Chunrui Han, MEGVII Technology;

(3) Yuang Peng, Tsinghua University and Internship at MEGVII;

(4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII;

(5) Zheng Ge, MEGVII Technology;

(6) Jinrong Yang, HUST and Internship at MEGVII;

(7) Liang Zhao, MEGVII Technology;

(8) Jianjian Sun, MEGVII Technology;

(9) Hongyu Zhou, MEGVII Technology;

(10) Haoran Wei, MEGVII Technology;

(11) Xiangwen Kong, MEGVII Technology;

(12) Xiangyu Zhang, MEGVII Technology and a Project leader;

(13) Kaisheng Ma, Tsinghua University and a Corresponding author;

(14) Li Yi, Tsinghua University, a Corresponding authors and Project leader.