Authors:
(1) Shengqiong Wu, NExT++, School of Computing, National University of Singapore;
(2) Hao Fei ,from NExT++, School of Computing at the National University of Singapore, serves as the corresponding author: haofei37@nus.edu.sg.
(3) Leigang Qu, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: haofei37@nus.edu.sg;;
(4) Wei Ji, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: haofei37@nus.edu.sg;;
(5) Tat-Seng Chua, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: haofei37@nus.edu.sg;. Table of Links Abstract and 1. Introduction
2 Related Work3 Overall Architecture4 Lightweight Multimodal Alignment Learning
5 Modality-switching Instruction Tuning
5.1 Instruction Tuning
5.2 Instruction Dataset
6 Experiments
6.1 Any-to-any Multimodal Generation and 6.2 Example Demonstrations
7 Conclusion and References 6 Experiments 6.1 Any-to-any Multimodal Generation We try to quantify the generation quality of NExT-GPT on certain benchmark datasets under some common settings, such as text-to-X generation, X-to-text generation, and Text-conditioned modality editing. We mimic the task by taking only one turn of interaction between the user and the model. • ‘Text’ — ‘X’ Generation represents the most frequent tasks of text-conditioned modal synthesis. Table 3, 4 and 5 present the comparisons between ours and some state-of-the-art systems. Overall NExT-GPT shows nice performance on par with the values from the best-performing baselines. • ‘X’ — ‘Text’ Generation represents the tasks of modal captioning. Table 6, 7 and 8 show the results on different tasks. Overall, we find that NExT-GPT can mostly achieve much better performance on the X-to-text generation than the CoDi baseline, owing to the direct generation of texts from LLM, which is inherently expertized by the LLM. • ‘Text+X’ — ‘X’ Generation represents a task category of text-conditioned modal editing. Table 9, 10 and 11 show the performances on different tasks. Compared with the above two types of tasks, NExT-GPT could be not that superior for the text-conditioned modal editing tasks. Yet, it still shows competitive performance • Human Evaluation on Complex Any-to-any QA We also carry out evaluation on some more scenarios where there are complicated cross-modal interactions between inputs and outputs. We mainly compare the model performance for the settings with different modality conversions. As no standard benchmark can be leveraged, here we adopt human evaluation. We ask several evaluators to score the performance of NExT-GPT on a scale from 1 to 10. Figure 5 shows the comparisons. We find NExT-GPT is more competent in producing images, compared with the generations on videos and audio. Also generating mixed combinations of multimodal content is slightly inferior to the generation of single-modal content, due to the complexity of the latter. 6.2 Example Demonstrations To demonstrate the effectiveness and potential of our proposed NExT-GPT in developing human-like conversational agents, here we further offer compelling examples that vividly illustrate the system’s exceptional capacity to comprehend and reason contents across various modalities in any combination. Figure 6, 7, 8, 9, 10 and 11 show the examples from NExT-GPT. Go to the project page for more examples and access the dynamic video and audio contents. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. Authors: (1) Shengqiong Wu, NExT++, School of Computing, National University of Singapore; (2) Hao Fei ,from NExT++, School of Computing at the National University of Singapore, serves as the corresponding author: haofei37@nus.edu.sg. (3) Leigang Qu, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: haofei37@nus.edu.sg;; (4) Wei Ji, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: haofei37@nus.edu.sg;; (5) Tat-Seng Chua, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: haofei37@nus.edu.sg;. Authors: Authors: (1) Shengqiong Wu, NExT++, School of Computing, National University of Singapore; (2) Hao Fei ,from NExT++, School of Computing at the National University of Singapore, serves as the corresponding author: haofei37@nus.edu.sg. (3) Leigang Qu, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: haofei37@nus.edu.sg;; (4) Wei Ji, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: haofei37@nus.edu.sg;; (5) Tat-Seng Chua, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: haofei37@nus.edu.sg;. Table of Links Abstract and 1. Introduction 2 Related Work3 Overall Architecture4 Lightweight Multimodal Alignment Learning 5 Modality-switching Instruction Tuning 5.1 Instruction Tuning 5.2 Instruction Dataset 6 Experiments 6.1 Any-to-any Multimodal Generation and 6.2 Example Demonstrations 7 Conclusion and References Abstract and 1. Introduction Abstract and 1. Introduction 2 Related Work 3 Overall Architecture 4 Lightweight Multimodal Alignment Learning 2 Related Work 3 Overall Architecture 4 Lightweight Multimodal Alignment Learning 5 Modality-switching Instruction Tuning 5.1 Instruction Tuning 5.1 Instruction Tuning 5.2 Instruction Dataset 5.2 Instruction Dataset 6 Experiments 6.1 Any-to-any Multimodal Generation and 6.2 Example Demonstrations 6.1 Any-to-any Multimodal Generation and 6.2 Example Demonstrations 7 Conclusion and References 7 Conclusion and References 6 Experiments 6.1 Any-to-any Multimodal Generation We try to quantify the generation quality of NExT-GPT on certain benchmark datasets under some common settings, such as text-to-X generation, X-to-text generation, and Text-conditioned modality editing. We mimic the task by taking only one turn of interaction between the user and the model. • ‘Text’ — ‘X’ Generation represents the most frequent tasks of text-conditioned modal synthesis. Table 3, 4 and 5 present the comparisons between ours and some state-of-the-art systems. Overall NExT-GPT shows nice performance on par with the values from the best-performing baselines. • ‘Text’ — ‘X’ Generation • ‘X’ — ‘Text’ Generation represents the tasks of modal captioning. Table 6, 7 and 8 show the results on different tasks. Overall, we find that NExT-GPT can mostly achieve much better performance on the X-to-text generation than the CoDi baseline, owing to the direct generation of texts from LLM, which is inherently expertized by the LLM. • ‘X’ — ‘Text’ Generation • ‘Text+X’ — ‘X’ Generation represents a task category of text-conditioned modal editing. Table 9, 10 and 11 show the performances on different tasks. Compared with the above two types of tasks, NExT-GPT could be not that superior for the text-conditioned modal editing tasks. Yet, it still shows competitive performance • ‘Text+X’ — ‘X’ Generation • Human Evaluation on Complex Any-to-any QA We also carry out evaluation on some more scenarios where there are complicated cross-modal interactions between inputs and outputs. We mainly compare the model performance for the settings with different modality conversions. As no standard benchmark can be leveraged, here we adopt human evaluation. We ask several evaluators to score the performance of NExT-GPT on a scale from 1 to 10. Figure 5 shows the comparisons. We find NExT-GPT is more competent in producing images, compared with the generations on videos and audio. Also generating mixed combinations of multimodal content is slightly inferior to the generation of single-modal content, due to the complexity of the latter. • Human Evaluation on Complex Any-to-any QA 6.2 Example Demonstrations 6.2 Example Demonstrations To demonstrate the effectiveness and potential of our proposed NExT-GPT in developing human-like conversational agents, here we further offer compelling examples that vividly illustrate the system’s exceptional capacity to comprehend and reason contents across various modalities in any combination. Figure 6, 7, 8, 9, 10 and 11 show the examples from NExT-GPT. Go to the project page for more examples and access the dynamic video and audio contents. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

NExT-GPT: Any-to-Any Multimodal LLM: Any-to-any Multimodal Generation

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

NExT-GPT: Any-to-Any Multimodal LLM: Abstract and Intro

NExT-GPT: Any-to-Any Multimodal LLM: Instruction Tuning

NExT-GPT: Any-to-Any Multimodal LLM: Conclusion and References

NExT-GPT: Any-to-Any Multimodal LLM: Related Work

NExT-GPT: Any-to-Any Multimodal LLM: Overall Architecture

12 Key Aspects for Assessing the Power of Text-to-Image Models

NExT-GPT: Any-to-Any Multimodal LLM: Abstract and Intro

NExT-GPT: Any-to-Any Multimodal LLM: Instruction Tuning

NExT-GPT: Any-to-Any Multimodal LLM: Conclusion and References

NExT-GPT: Any-to-Any Multimodal LLM: Related Work

NExT-GPT: Any-to-Any Multimodal LLM: Overall Architecture

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps