3. LLaVA-Phi and 3.1. Training
5. Conclusion, Limitation, and Future Works and References
The rapid advancements in Large Language Models (LLMs) have significantly propelled the development of vision-language models based on LLMs. These models, representing a departure from the capabilities of the preLLM era, are equipped with advanced question-answering and visual comprehension skills. This progress is enabled by using LLMs as language encoding modules. Notable research in this domain includes the LLaVA-family [24, 25, 26, 32], the BLIP-family [8, 20], MiniGPT-4 [37], and others. Each has demonstrated significant advancements in managing visual-centric dialogues. However, a common limitation of these open-sourced Vision-Language Models (VLMs) is their substantial computational demands, typically ranging from 7B to 65B parameters. This requirement poses challenges for deployment on edge or mobile devices, especially in real-time applications. Gemini [33], a leader in this field, has released three versions of visionlanguage models, including the compact Gemini-Nano with 1.8B/3.25B parameters, tailored for smartphones. However, their models and data are not open-sourced. Another initiative, MobileVLM [6], has developed mobileLLaMA with 2.7B parameters to facilitate smaller vision-language models. Our paper explores and demonstrates the effectiveness of integrating vision-language models with open-sourced, smaller language models, assessing their potential and efficiency in a variety of applications.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Yichen Zhu, Midea Group;
(2) Minjie Zhu, Midea Group and East China Normal University;
(3) Ning Liu, Midea Group;
(4) Zhicai Ou, Midea Group;
(5) Xiaofeng Mou, Midea Group.