ReadWrite
paint-brush
Using MLLMs for Diffusion Synthesis That Synergizes Both Sides: How Is This Possible?by@textmodels
132 reads

Using MLLMs for Diffusion Synthesis That Synergizes Both Sides: How Is This Possible?

by Writings, Papers and Blogs on Text ModelsNovember 23rd, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Multimodal signals typically exhibit modality-specific information that has distinct structure but complementary semantics (Dong et al., 2023). This complementary property allows us to utilize deep language comprehension to enhance cross-modal image generation (Saharia et al., 2022). However, the potential of multimodal creation to improve comprehension remains largely unexplored.
featured image - Using MLLMs for Diffusion Synthesis That Synergizes Both Sides: How Is This Possible?
an explosion of energy with green light on the left side and purple light on the right side big and colorful Image created by HackerNoon AI Image Generator
Writings, Papers and Blogs on Text Models HackerNoon profile picture

Abstract and 1 Introduction

2 Background & Problem Statement

2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides?

3 DreamLLM

3.1 End-to-End Interleaved generative Pretraining (I-GPT)

3.2 Model Training

4 Experiments and 4.1 Multimodal Comprehension

4.2 Text-Conditional Image Synthesis

4.3 Multimodal Joint Creation & Comprehension

5 Discussions

5.1 Synergy between creation & Comprehension?

5. 2 What is learned by DreamLLM?

6 Related Works

7 Conclusions and References


A Additional Experiments

B Additional Qualitative Examples

C Implementation Details

D Additional Related Works

E Limitations, Failure Cases & Future Works

2.1 How Can We Use MLLMs for Diffusion Synthesis That Synergizes Both Sides?

Multimodal signals typically exhibit modality-specific information that has distinct structure but complementary semantics (Dong et al., 2023). This complementary property allows us to utilize deep language comprehension to enhance cross-modal image generation (Saharia et al., 2022). However, the potential of multimodal creation to improve comprehension remains largely unexplored.



Learning Objective Our aim is to leverage MLLMs to model distributions via direct pixel space sampling. Here, the pretrained SD functions as a score metric, distilling the learned data distribution. This approach is similar to Score Distillation Sampling (Poole et al., 2023) (SDS, also known as Score Jacobian Chaining (Wang et al., 2023a)). In this context, image posterior is learned in a DeepDream-like manner (Mordvintsev et al., 2015), using MLLMs’ conditional parameterization.


Conditional Embeddings Rather than converting the output space of MLLMs to align with CLIP, we propose to query MLLMs using learned embeddings. Consequently, MLLMs-enriched semantics serve as diffusion conditioning, and the distribution is implicitly modeled through synthesis sampling.


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.


Authors:

(1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII;

(2) Chunrui Han, MEGVII Technology;

(3) Yuang Peng, Tsinghua University and Internship at MEGVII;

(4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII;

(5) Zheng Ge, MEGVII Technology;

(6) Jinrong Yang, HUST and Internship at MEGVII;

(7) Liang Zhao, MEGVII Technology;

(8) Jianjian Sun, MEGVII Technology;

(9) Hongyu Zhou, MEGVII Technology;

(10) Haoran Wei, MEGVII Technology;

(11) Xiangwen Kong, MEGVII Technology;

(12) Xiangyu Zhang, MEGVII Technology and a Project leader;

(13) Kaisheng Ma, Tsinghua University and a Corresponding author;

(14) Li Yi, Tsinghua University, a Corresponding authors and Project leader.


HackerNoon Services
L O A D I N G
. . . comments & more!

About Author

Writings, Papers and Blogs on Text Models HackerNoon profile picture
Writings, Papers and Blogs on Text Models@textmodels
We publish the best academic papers on rule-based techniques, LLMs, & the generation of text that resembles human text.
Read my storiesAbout @textmodels

TOPICS

purcat-imgmachine-learning #machine-learning-framework #dreamllm #multimodal-llm #llms #diffusion-synthesis #what-is-dreamllm #score-distillation-sample #multimodal-creation

THIS ARTICLE WAS FEATURED IN...

Permanent on Arweave
Read on Terminal Reader Terminal
Read this story w/o Javascript Lite
Hackernoon
X
Threads
Bsky

RELATED STORIES

Article Thumbnail
Gemini - A Family of Highly Capable Multimodal Models: Abstract and Introduction
by textmodels
Dec 24, 2023
#gemini
Article Thumbnail
Unlocking Synergy: Advancements in Training Multimodal Language Models for Vision and Text
by mikeyoung44
Sep 26, 2023
#artificial-intelligence
Article Thumbnail
What Is DreamLLM? Everything You Need to Know About the Learning Framework
by textmodels
Nov 23, 2024
#machine-learning-framework
Article Thumbnail
Get to Know More About DreamLLM: The Background on This Learning Framework
by textmodels
Nov 23, 2024
#machine-learning-framework
Article Thumbnail
How DreamLLM Generates an Image On Its Own "Free Will"
by textmodels
Nov 24, 2024
#machine-learning-framework
Join HackerNoonloading
Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas