In this study, researchers present an end-to-end general-purpose any-to-any MM-LLM system called NExT-GPT.
(1) Shengqiong Wu, NExT++, School of Computing, National University of Singapore;

(2) Hao Fei ,from NExT++, School of Computing at the National University of Singapore, serves as the corresponding author: [email protected].

(3) Leigang Qu, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: [email protected];;

(4) Wei Ji, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: [email protected];;

(5) Tat-Seng Chua, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: [email protected];.

7 Conclusion

In this work, we present an end-to-end general-purpose any-to-any multimodal Large Language Model (MM-LLM). By connecting an LLM with multimodal adaptors and different diffusion decoders, NExT-GPT is capable of perceiving inputs and generating outputs in any combination of text, images, videos, and audio. Harnessing the existing well-trained highly-performing encoders and decoders, training NExT-GPT only entails a few number of parameters (1%) of certain projection layers, which not only benefits low costs but also facilitates convenient expansion to future more potential modalities. To enable our NExT-GPT with complex cross-modal semantic understanding and content generation, we introduce a modality-switching instruction tuning (MosIT), and manually curated a high-quality dataset for MosIT. Overall, our research showcases the potential of any-to-any MMLLMs in bridging the gap between various modalities and paving the way for more human-like AI systems in the future.

Limitation and Future work As future work, there are at least following four avenues to explore.

i) Modalities & Tasks Expansion: Due to resource limitations, currently, our system supports input and output in four modalities: language, images, videos, and audio. Next, we plan to extend this to accommodate even more modalities (e.g., web page, 3D vision, heat map, tables&figures) and tasks (e.g., object detection, segmentation, grounding and tracking), broadening the system’s applicability such that it becomes more universal.

ii) LLM Variants: Currently, we have implemented the 7B Vicuna version of the LLM. Our next plans involve incorporating various LLM types and sizes, allowing practitioners to choose the most suitable one for their specific requirements.

iii) Multimodal Generation Strategies: While our system excels in generating content across modalities, the quality of generative outputs can sometimes be limited by the capabilities of the diffusion model. It is very promising to explore the integration of retrieval-based approaches to complement the generative process, potentially improving the overall system’s performance.

iv) MosIT Dataset Expansion: Currently, our IT dataset has room for expansion. We intend to significantly increase the amount of annotated data, ensuring a more comprehensive and diverse set of instructions to further enhance the MM-LLMs’ ability to understand and follow user prompts effectively.

Figure 6: Example of Text+Image → Text+Audio.

Figure 7: Example of Text → Text+Image+Video+Audio.

Figure 8: Example of Text+Image → Text+Image+Video+Audio.

Figure 9: Example of Text+Video → Text+Image.

Figure 10: Example of Text+Audio → Text+Image+Video.

Figure 11: Example of Text+Video → Text+Audio.


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.