paint-brush

This story draft by @escholar has not been reviewed by an editor, YET.

DreamLLM: Synergistic Multimodal Comprehension and Creation: DreamLLM

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

Authors:

(1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII;

(2) Chunrui Han, MEGVII Technology;

(3) Yuang Peng, Tsinghua University and Internship at MEGVII;

(4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII;

(5) Zheng Ge, MEGVII Technology;

(6) Jinrong Yang, HUST and Internship at MEGVII;

(7) Liang Zhao, MEGVII Technology;

(8) Jianjian Sun, MEGVII Technology;

(9) Hongyu Zhou, MEGVII Technology;

(10) Haoran Wei, MEGVII Technology;

(11) Xiangwen Kong, MEGVII Technology;

(12) Xiangyu Zhang, MEGVII Technology and a Project leader;

(13) Kaisheng Ma, Tsinghua University and a Corresponding author;

(14) Li Yi, Tsinghua University, a Corresponding authors and Project leader.

Table of Links

Abstract and 1 Introduction

2 Background & Problem Statement

2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides?

3 DreamLLM

3.1 End-to-End Interleaved generative Pretraining (I-GPT)

3.2 Model Training

4 Experiments and 4.1 Multimodal Comprehension

4.2 Text-Conditional Image Synthesis

4.3 Multimodal Joint Creation & Comprehension

5 Discussions

5.1 Synergy between creation & Comprehension?

5. 2 What is learned by DreamLLM?

6 Related Works

7 Conclusions and References


A Additional Experiments

B Additional Qualitative Examples

C Implementation Details

D Additional Related Works

E Limitations, Failure Cases & Future Works

3 DREAMLLM

We introduce DREAMLLM, a universal learning framework that facilitates both MLLM’s comprehension and creation capabilities. Our DREAMLLM is built with a causal decoder-only LLM Fθ as the model foundation, i.e., Vicuna (Chiang et al., 2023) based on LLaMA (Touvron et al., 2023a)


Figure 2: Overview of of our DREAMLLM framework. Interleaved documents serve as input, decoded to produce outputs. Both text and images are encoded into sequential, discrete token embeddings for the MLLM input. A special  token predicts where to generate images. Subsequently, a series of dream queries are fed into the MLLM, capturing holistic historical semantics. The images are synthesized by the SD image decoder conditioned on queried semantics. The synthesized images are then fed back into the MLLM for subsequent comprehension.


trained on ShareGPT (Zheng et al., 2023). We adopt OpenAI’s CLIP-Large (Radford et al., 2021) as the visual encoder Hϕ, followed by a linear layer Mζ for visual embedding projection. To synthesize images, we use Stable Diffusion (SD) (Rombach et al., 2022) as the image decoder, and the condition projector Mψ is also a linear layer. An overview of the architecture is depicted in Fig. 2.


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...