102 reads

DreamLLM: Crucial Implementation Details

by Writings, Papers and Blogs on Text ModelsNovember 28th, 2024

Too Long; Didn't Read

Systemic evaluations of DREAMLLM regarding VL comprehension, content creation, and NLP capabilities have been conducted. See the used benchmarks and datasets

featured image - DreamLLM: Crucial Implementation Details

‘dreams entering a brain trippy’ Image created by HackerNoon AI Image Generator

Table of Links

Abstract and 1 Introduction

2 Background & Problem Statement

2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides?

3 DreamLLM

3.1 End-to-End Interleaved generative Pretraining (I-GPT)

3.2 Model Training

4 Experiments and 4.1 Multimodal Comprehension

4.2 Text-Conditional Image Synthesis

4.3 Multimodal Joint Creation & Comprehension

5 Discussions

5.1 Synergy between creation & Comprehension?

5. 2 What is learned by DreamLLM?

6 Related Works

7 Conclusions and References

A Additional Experiments

B Additional Qualitative Examples

C Implementation Details

D Additional Related Works

E Limitations, Failure Cases & Future Works

C IMPLEMENTATION DETAILS

C.1 TRAINING DATA & HYPER-PARAMETERS

In Table 11, we list the detailed training dataset usage and hyper-parameters. The training data are constructed based on the following datasets: a) LAION400M (Schuhmann et al., 2021), b) LAIONCOCO (Schuhmann et al., 2023), c) MMC4 (Zhu et al., 2023b), d) BLIP-LAION (Li et al., 2022) which is filtered and caption by BLIP (Li et al., 2022), e) LLaVAPretrain (Liu et al., 2023a) which

contains 558K image-text pairs from BLIP-captioned CC3M (Sharma et al., 2018), SBU (Ordonez et al., 2011), and LAION400M filtered by LLaVA, f) LLaVAInstruct (Liu et al., 2023a), which contains 80K visual instruction-following data constructed by LLaVA, and g) InstructMMC4, which is our instruction-following interleaved document generation data curated by prompting GPT-4 to generate instruction based on the text contents of MMC4. h) Instruct-BLIP-LAION, which is our instruction-following image synthesis data.

Similar to InstructMMC4, it is curated by prompting GPT-4 to generate instructions based on image captions. Unless otherwise specified, we randomly sample the indicated number of instances from each dataset during the training process.

C.2 DREAMLLM MODEL

Language Model We use LLaMA-1 (Touvron et al., 2023a) trained on ShareGPT (Zheng et al., 2023) as as the default LLM (i.e., Vicuna-7B[1] (Chiang et al., 2023)) following Liu et al. (2023a) to endow its instruction-following capacity. During training, we use Flash Attention (Dao et al., 2022) and PyTorch FSDP (Zhao et al., 2023b) to accelerate training efficiency.

Visual Encoder The visual encoder is the publicly available OpenAI CLIP-L/14 (Radford et al., 2021) model, which is frozen during the whole process. The images are resized to 224×224 resolution to align with the CLIP pretraining settings, resulting in a sequence of 256 total tokens for each image. Following prior VL practice (Lu et al., 2019; Liu et al., 2023a), we append a special token before the image sequence and a special at the end of the sequence.

Diffusion Image Decoder We adopt SDv2.1 (Rombach et al., 2022) trained on 512×512 resolution as the default diffusion image decoder. Same as the visual encoder, the SD model is frozen without any modifications or training throughout the whole process. When constructing the SD target to compute the MSE loss, we resize the images to 512 resolution to fit its pretraining configuration.

Dream Query We use dream queries to gather semantic context from MLLMs as introduced before in Sec. 3. Without specifications, we use 64 learnable query embeddings. It is both efficient and effective in generating high-quality images. In order to predict when to generate images, we also introduce the special token, which is appended before the dream query sequence. A is appended at the end of the sequence, similar to image inputs.

C.3 EVALUATION BENCHMARKS

Systemic evaluations of DREAMLLM regarding VL comprehension, content creation, and NLP capabilities have been conducted. See the used benchmarks and datasets listed in Table 11. During the evaluation, we use the prompt templates listed in Fig. 12.