Authors:
(1) Qian Yang, Zhejiang University, Equal contribution. This work was conducted during Qian Yang’s internship at Alibaba Group;
(2) Jin Xu, Alibaba Group, Equal contribution;
(3) Wenrui Liu, Zhejiang University;
(4) Yunfei Chu, Alibaba Group;
(5) Xiaohuan Zhou, Alibaba Group;
(6) Yichong Leng, Alibaba Group;
(7) Yuanjun Lv, Alibaba Group;
(8) Zhou Zhao, Alibaba Group and Corresponding to Zhou Zhao ([email protected]);
(9) Yichong Leng, Zhejiang University
(10) Chang Zhou, Alibaba Group and Corresponding to Chang Zhou ([email protected]);
(11) Jingren Zhou, Alibaba Group.
4 Experiments
4.3 Human Evaluation and 4.4 Ablation Study of Positional Bias
A Detailed Results of Foundation Benchmark
The results of LALMs are present in Table 3. For the foundation benchmark, we also conduct a comparison between the use of an exact matching strategy with our proposed GPT-4 alignment strategy. As an example, we try to match ‘B’, ‘B.’, ‘B)’, etc. with LALMs’ hypothesis for the exact matching. The results are shown in Table 4. We can find that BLSP and SALMONN have a high success rate in directly generating the choice, showcasing their strong ability to follow single-choice instruction. However, we find that it is challenging to precisely extract the predicted choice from the hypotheses of other models due to significant variations in the output formats of different LALMs. However, with the assistance of GPT-4 as the evaluator, the success rate for all models can be improved to 100%.
According to Table 3, Qwen-Audio-Chat and Qwen-Audio Turbo demonstrate superior performance in the foundation benchmark, surpassing other models in the domains of speech, sound, and music. Second to the two models, PandaGPT and SALMONN also exhibit noteworthy performances. Regarding the chat benchmark, Qwen-Audio Turbo achieves the highest average score, followed by SALMONN and Qwen-Audio-Chat with scores of 6.11 and 6.08, respectively. Among these models, SALMONN outperforms others in terms of mixed audio understanding. Note that the speech dimension in the foundation benchmark includes tasks beyond speech transcriptions, such as speaker gender, age, and emotion prediction, while the speech in the chat benchmark primarily revolves around speech transcriptions. Thus, Whisper plus GPT-4 receives a relatively low score in the foundation benchmark but obtains the highest score in the chat benchmark.
Based on these results, we have several observations: 1) The existing LALMs either have limited audio understanding or instruction-following capabilities. For instance, Qwen-Audio Turbo achieves the highest average score in both foundation and chat benchmarks while the model displays a weak proficiency in following single-choice instructions such as often directly generating a full sentence semantically akin to one of the choices, and thus receives a low success rate for the exact matching; 2) As for chat abilities related only to speech transcription, none of the models surpass the concatenative baseline Whisper plus GPT-4.
This paper is available on arxiv under CC BY 4.0 DEED license.