Authors:
(1) Yichen Zhu, Midea Group;
(2) Minjie Zhu, Midea Group and East China Normal University;
(3) Ning Liu, Midea Group;
(4) Zhicai Ou, Midea Group;
(5) Xiaofeng Mou, Midea Group.
3. LLaVA-Phi and 3.1. Training
5. Conclusion, Limitation, and Future Works and References
We present several examples that demonstrate the remarkable generalization capabilities of LLaVA-Phi, comparing its outputs with those of the LLaVA-1.5-13B models. In Figure 1, a meme is displayed, and we ask the visionlanguage assistant to explain why this meme is considered humorous. While LLaVA-1.5-13B provides a reasonable interpretation based on the image, LLaVA-Phi’s response is more empathetic, highlighting the humor by associating the dog’s ’laid-back demeanor’ with the ’stress or fatigue’ typically associated with a ’new workweek’.
In the second example, we instructed the model to generate Python code for converting an Excel table into a bar chart, as illustrated in Figure 2. LLaVA-1.5-13B generated a simplistic code snippet that only reads the table and prints it, diverging from the instructions to create a plot. In contrast, LLaVA-Phi accurately comprehended the task, providing instructions to read the table, add a title and labels, and correctly plot the bar chart using matplotlib. We believe this enhanced code generation capability stems from Phi-2, which was pre-trained on a large corpus of code snippets and is primarily used for code generation.
The third challenge involves solving a simple math problem, requiring the model to accurately recognize text through OCR and then perform the necessary mathematical computations, as shown in Figure 3. LLaVA-1.5-13B, while providing a step-by-step computation based on the image, incorrectly recognized the numbers and mathematical symbols. In contrast, our proposed LLaVA-Phi, without providing a chain-of-thought reasoning, still produces the correct answer. Our quantitative results on ScienceQA further confirm that LLaVA-Phi excels in these types of questionanswering tasks.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Yichen Zhu, Midea Group;
(2) Minjie Zhu, Midea Group and East China Normal University;
(3) Ning Liu, Midea Group;
(4) Zhicai Ou, Midea Group;
(5) Xiaofeng Mou, Midea Group.