Authors:
(1) Yichen Zhu, Midea Group;
(2) Minjie Zhu, Midea Group and East China Normal University;
(3) Ning Liu, Midea Group;
(4) Zhicai Ou, Midea Group;
(5) Xiaofeng Mou, Midea Group. Table of Links Abstract and 1 Introduction 2. Related Work 3. LLaVA-Phi and 3.1. Training 3.2. Qualitative Results 4. Experiments 5. Conclusion, Limitation, and Future Works and References 3.2. Qualitative Results We present several examples that demonstrate the remarkable generalization capabilities of LLaVA-Phi, comparing its outputs with those of the LLaVA-1.5-13B models. In Figure 1, a meme is displayed, and we ask the visionlanguage assistant to explain why this meme is considered humorous. While LLaVA-1.5-13B provides a reasonable interpretation based on the image, LLaVA-Phi’s response is more empathetic, highlighting the humor by associating the dog’s ’laid-back demeanor’ with the ’stress or fatigue’ typically associated with a ’new workweek’. In the second example, we instructed the model to generate Python code for converting an Excel table into a bar chart, as illustrated in Figure 2. LLaVA-1.5-13B generated a simplistic code snippet that only reads the table and prints it, diverging from the instructions to create a plot. In contrast, LLaVA-Phi accurately comprehended the task, providing instructions to read the table, add a title and labels, and correctly plot the bar chart using matplotlib. We believe this enhanced code generation capability stems from Phi-2, which was pre-trained on a large corpus of code snippets and is primarily used for code generation. The third challenge involves solving a simple math problem, requiring the model to accurately recognize text through OCR and then perform the necessary mathematical computations, as shown in Figure 3. LLaVA-1.5-13B, while providing a step-by-step computation based on the image, incorrectly recognized the numbers and mathematical symbols. In contrast, our proposed LLaVA-Phi, without providing a chain-of-thought reasoning, still produces the correct answer. Our quantitative results on ScienceQA further confirm that LLaVA-Phi excels in these types of questionanswering tasks. This paper is available on arxiv under CC BY 4.0 DEED license. Authors:
(1) Yichen Zhu, Midea Group;
(2) Minjie Zhu, Midea Group and East China Normal University;
(3) Ning Liu, Midea Group;
(4) Zhicai Ou, Midea Group;
(5) Xiaofeng Mou, Midea Group. Authors: (1) Yichen Zhu, Midea Group; (2) Minjie Zhu, Midea Group and East China Normal University; (3) Ning Liu, Midea Group; (4) Zhicai Ou, Midea Group; (5) Xiaofeng Mou, Midea Group. Authors: Authors: (1) Yichen Zhu, Midea Group; (2) Minjie Zhu, Midea Group and East China Normal University; (3) Ning Liu, Midea Group; (4) Zhicai Ou, Midea Group; (5) Xiaofeng Mou, Midea Group. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Work 2. Related Work 3. LLaVA-Phi and 3.1. Training 3. LLaVA-Phi and 3.1. Training 3.2. Qualitative Results 3.2. Qualitative Results 4. Experiments 4. Experiments 5. Conclusion, Limitation, and Future Works and References 5. Conclusion, Limitation, and Future Works and References 3.2. Qualitative Results We present several examples that demonstrate the remarkable generalization capabilities of LLaVA-Phi, comparing its outputs with those of the LLaVA-1.5-13B models. In Figure 1, a meme is displayed, and we ask the visionlanguage assistant to explain why this meme is considered humorous. While LLaVA-1.5-13B provides a reasonable interpretation based on the image, LLaVA-Phi’s response is more empathetic, highlighting the humor by associating the dog’s ’laid-back demeanor’ with the ’stress or fatigue’ typically associated with a ’new workweek’. In the second example, we instructed the model to generate Python code for converting an Excel table into a bar chart, as illustrated in Figure 2. LLaVA-1.5-13B generated a simplistic code snippet that only reads the table and prints it, diverging from the instructions to create a plot. In contrast, LLaVA-Phi accurately comprehended the task, providing instructions to read the table, add a title and labels, and correctly plot the bar chart using matplotlib. We believe this enhanced code generation capability stems from Phi-2, which was pre-trained on a large corpus of code snippets and is primarily used for code generation. The third challenge involves solving a simple math problem, requiring the model to accurately recognize text through OCR and then perform the necessary mathematical computations, as shown in Figure 3. LLaVA-1.5-13B, while providing a step-by-step computation based on the image, incorrectly recognized the numbers and mathematical symbols. In contrast, our proposed LLaVA-Phi, without providing a chain-of-thought reasoning, still produces the correct answer. Our quantitative results on ScienceQA further confirm that LLaVA-Phi excels in these types of questionanswering tasks. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv Authors: (1) Yichen Zhu, Midea Group; (2) Minjie Zhu, Midea Group and East China Normal University; (3) Ning Liu, Midea Group; (4) Zhicai Ou, Midea Group; (5) Xiaofeng Mou, Midea Group. Authors: Authors: (1) Yichen Zhu, Midea Group; (2) Minjie Zhu, Midea Group and East China Normal University; (3) Ning Liu, Midea Group; (4) Zhicai Ou, Midea Group; (5) Xiaofeng Mou, Midea Group.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

LLaVA-Phi: Qualitative Results - Take A Look At Its Remarkable Generelization Capabilities

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

LLaVA-Phi: The Training We Put It Through

Introducing LLaVA-Phi: A Compact Vision-Language Assistant Powered By a Small Language Model

LLaVA-Phi: Related Work to Get You Caught Up

LLaVA-Phi: Limitations and What You Can Expect in the Future

100 Days of AI Day 2: Enhancing Prompt Engineering for ChatGPT

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

LLaVA-Phi: The Training We Put It Through

Introducing LLaVA-Phi: A Compact Vision-Language Assistant Powered By a Small Language Model

LLaVA-Phi: Related Work to Get You Caught Up

LLaVA-Phi: Limitations and What You Can Expect in the Future

100 Days of AI Day 2: Enhancing Prompt Engineering for ChatGPT

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps