paint-brush

This story draft by @escholar has not been reviewed by an editor, YET.

Limitations, Failure Cases & Future Works

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

Authors:

(1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII;

(2) Chunrui Han, MEGVII Technology;

(3) Yuang Peng, Tsinghua University and Internship at MEGVII;

(4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII;

(5) Zheng Ge, MEGVII Technology;

(6) Jinrong Yang, HUST and Internship at MEGVII;

(7) Liang Zhao, MEGVII Technology;

(8) Jianjian Sun, MEGVII Technology;

(9) Hongyu Zhou, MEGVII Technology;

(10) Haoran Wei, MEGVII Technology;

(11) Xiangwen Kong, MEGVII Technology;

(12) Xiangyu Zhang, MEGVII Technology and a Project leader;

(13) Kaisheng Ma, Tsinghua University and a Corresponding author;

(14) Li Yi, Tsinghua University, a Corresponding authors and Project leader.

Table of Links

Abstract and 1 Introduction

2 Background & Problem Statement

2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides?

3 DreamLLM

3.1 End-to-End Interleaved generative Pretraining (I-GPT)

3.2 Model Training

4 Experiments and 4.1 Multimodal Comprehension

4.2 Text-Conditional Image Synthesis

4.3 Multimodal Joint Creation & Comprehension

5 Discussions

5.1 Synergy between creation & Comprehension?

5. 2 What is learned by DreamLLM?

6 Related Works

7 Conclusions and References


A Additional Experiments

B Additional Qualitative Examples

C Implementation Details

D Additional Related Works

E Limitations, Failure Cases & Future Works

E LIMITATIONS, FAILURE CASES & FUTURE WORKS

Limitations While DREAMLLM has made significant strides toward the development of versatile, creative, and foundational MLLMs, it still has several limitations.


Model scale. The primary constraint pertains to the scale of the LLMs utilized. Current evaluations mainly employ 7B LLMs as the base model, and despite the impressive results garnered, the potential benefits of larger model sizes, such as 65B or 130B (Kaplan et al., 2020), are worth future exploration.


Training data. The second challenge relates to the quality and quantity of training data (Jia et al., 2021). As the model size and capabilities scale up, a corresponding increase in data is crucial. However, the procurement and refinement of high-quality training data present substantial logistical and financial hurdles. For instance, the open-source interleaved dataset MMC4 contains a significant amount of noise in the form of text and images, like commercial advertisements. This noise could adversely affect the model’s output language and image style.


Prompt sensitivity. The sensitivity of LLMs to human prompts is a known issue (Wei et al., 2022b; Wang et al., 2023b; Zhou et al., 2023), a challenge that extends to MLLMs. For instance, MLLMs’ propensity for detailed responses necessitates tailored prompting to elicit concise and short answers, which is particularly useful when addressing Visual Question Answering (VQA) tasks.


Failure Cases The main failure cases of DREAMLLM are observed for multiple image-based content creations. For instance, when presented with two images and a composite instruction such as “A and B”, DREAMLLM sometimes generates a single subject that amalgamates the characteristics of A and B. This output aligns more closely with the directive “A like B”. This phenomenon is not unique to DREAMLLM, but is also observed in specialized compositional generation methodologies, such as Structure Diffusion (Feng et al., 2023; Chefer et al., 2023). This recurring issue may be attributed to the inherent complexity of compositional generation tasks, compounded by the severe scarcity of data specific to this domain.


Future Works As a simple and general multimodal learning framework, our future work aims to enhance the DREAMLLM framework by integrating fine-grained visual comprehension via methods like precise referring instruction tuning (Zhao et al., 2023a). We also plan to expand beyond visual and linguistic content comprehension and generation. Several promising research directions include:


• Exploring applications of in-context generation capabilities of DREAMLLM to complex tasks such as image-to-image translation (Isola et al., 2017; Zhang et al., 2023c; Zhang & Agrawala, 2023).


• Utilizing DREAMLLM’s context consistency feature for geometry-preserving tasks, including 3D content creation (Poole et al., 2023; Qi et al., 2023b; Liu et al., 2023b), representation learning (Dong et al., 2023; Qi et al., 2023a; Zhang et al., 2023a;e), scene comprehension (Zhang et al., 2023b; Hong et al., 2023), and embodied artificial inteligence (Ichter et al., 2022).


• Striving to achieve a unified multimodal zero-shot generalist by extending the scope to various modalities using techniques such as ImageBind (Girdhar et al., 2023) and exploring content creation models in other modalities like audio (Kong et al., 2021).


Figure 7: Qualitative examples of multimodal dialogue between human and DREAMLLM. Various modalities can be used as inputs or outputs, and multi-round dialogue is shown.


Figure 8: Qualitative examples of multimodal dialogue between human and DREAMLLM. Various modalities can be used as inputs or outputs, and multi-round dialogue is shown.


Figure 9: Qualitative examples of multimodal dialogue between human and DREAMLLM. Various modalities can be used as inputs or outputs, and multi-round dialogue is shown.


Figure 10: DREAMLLM text-conditional image generation examples with prompts from (a-b) DALLE (Ramesh et al., 2021), (c d) DALL-E 2 (Ramesh et al., 2022), (e-f) GLIDE (Nichol et al., 2022).


Figure 11: DREAMLLM text-conditional image generation examples with prompts from (a-c) Imagen and DrawBench (Saharia et al., 2022), (d-f) Parti (i.e., PartiPrompts or P2) (Yu et al., 2022b).


Figure 12: Prompt template used for vision-language evaluations. (a) Short image captioning includes COCO captioning, and (b) long image captioning includes Image2Paragraph. (c) Short VQA includes VQAv2, VizWiz, OKVQA, and TextVQA. (d) Long VQA includes MMBench and MM-Vet. (e) Visual hallucination includes POPE.  denotes the input image representation, denotes each specific question,  is the generated answer, and  is the specific object name in a question of POPE.


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...