Authors:
(1) Shengqiong Wu, NExT++, School of Computing, National University of Singapore;
(2) Hao Fei ,from NExT++, School of Computing at the National University of Singapore, serves as the corresponding author: [email protected].
(3) Leigang Qu, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: [email protected];;
(4) Wei Ji, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: [email protected];;
(5) Tat-Seng Chua, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: [email protected];.
We try to quantify the generation quality of NExT-GPT on certain benchmark datasets under some common settings, such as text-to-X generation, X-to-text generation, and Text-conditioned modality editing. We mimic the task by taking only one turn of interaction between the user and the model.
• ‘Text’ — ‘X’ Generation represents the most frequent tasks of text-conditioned modal synthesis. Table 3, 4 and 5 present the comparisons between ours and some state-of-the-art systems. Overall NExT-GPT shows nice performance on par with the values from the best-performing baselines.
• ‘X’ — ‘Text’ Generation represents the tasks of modal captioning. Table 6, 7 and 8 show the results on different tasks. Overall, we find that NExT-GPT can mostly achieve much better performance on the X-to-text generation than the CoDi baseline, owing to the direct generation of texts from LLM, which is inherently expertized by the LLM.
• ‘Text+X’ — ‘X’ Generation represents a task category of text-conditioned modal editing. Table 9, 10 and 11 show the performances on different tasks. Compared with the above two types of tasks, NExT-GPT could be not that superior for the text-conditioned modal editing tasks. Yet, it still shows competitive performance
• Human Evaluation on Complex Any-to-any QA We also carry out evaluation on some more scenarios where there are complicated cross-modal interactions between inputs and outputs. We mainly compare the model performance for the settings with different modality conversions. As no standard benchmark can be leveraged, here we adopt human evaluation. We ask several evaluators to score the performance of NExT-GPT on a scale from 1 to 10. Figure 5 shows the comparisons. We find NExT-GPT is more competent in producing images, compared with the generations on videos and audio. Also generating mixed combinations of multimodal content is slightly inferior to the generation of single-modal content, due to the complexity of the latter.
To demonstrate the effectiveness and potential of our proposed NExT-GPT in developing human-like conversational agents, here we further offer compelling examples that vividly illustrate the system’s exceptional capacity to comprehend and reason contents across various modalities in any combination. Figure 6, 7, 8, 9, 10 and 11 show the examples from NExT-GPT. Go to the project page for more examples and access the dynamic video and audio contents.
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.