paint-brush
Evaluating Audio Processing in Large Audio-Language Modelsby@benchmarking

Evaluating Audio Processing in Large Audio-Language Models

tldt arrow

Too Long; Didn't Read

This section reviews benchmarks for audio processing, covering datasets like Librispeech and FLEURS, and introduces AIR-Bench, the first generative evaluation for large audio-language models, emphasizing their instruction-following capabilities.
featured image - Evaluating Audio Processing in Large Audio-Language Models
Benchmarking in Business Technology and Software HackerNoon profile picture

Authors:

(1) Qian Yang, Zhejiang University, Equal contribution. This work was conducted during Qian Yang’s internship at Alibaba Group;

(2) Jin Xu, Alibaba Group, Equal contribution;

(3) Wenrui Liu, Zhejiang University;

(4) Yunfei Chu, Alibaba Group;

(5) Xiaohuan Zhou, Alibaba Group;

(6) Yichong Leng, Alibaba Group;

(7) Yuanjun Lv, Alibaba Group;

(8) Zhou Zhao, Alibaba Group and Corresponding to Zhou Zhao ([email protected]);

(9) Yichong Leng, Zhejiang University

(10) Chang Zhou, Alibaba Group and Corresponding to Chang Zhou ([email protected]);

(11) Jingren Zhou, Alibaba Group.

Abstract and 1. Introduction

2 Related Work

3 AIR-Bench and 3.1 Overview

3.2 Foundation Benchmark

3.3 Chat Benchmark

3.4 Evaluation Strategy

4 Experiments

4.1 Models

4.2 Main Results

4.3 Human Evaluation and 4.4 Ablation Study of Positional Bias

5 Conclusion and References

A Detailed Results of Foundation Benchmark

Benchmarks for Audio Processing. Previous studies have primarily focused on evaluating the specific fundamental capabilities of models. In the field of speech processing, automatic speech recognition is one of the most popular tasks, with representative benchmarks including Librispeech (Panayotov et al., 2015), Common Voice (Ardila et al., 2019), and FLEURS (Conneau et al., 2022). Additionally, there are various benchmarks available for different speech processing tasks such as speech-to-text translation (Wang et al., 2020a,b; Jia et al., 2022) and emotion recognition (Cao et al., 2014; Livingstone and Russo, 2018). In the field of sound processing, several benchmarks have emerged such as Clotho (Drossos et al., 2020) and Audiocaps (Kim et al., 2019a) for automatic audio captioning, and AVQA (Yang et al., 2022) for sound question answering. In the domain of music processing, numerous datasets are available, including MusicCaps (Agostinelli et al., 2023) for automatic music captioning, and MUSIC-AVQA (Li et al., 2022) for music question answering. Note that most existing question answering benchmarks, such as Clotho-AQA, AVQA, and MUSIC-AVQA, have highly constrained answer formats for ease of close-ended evaluation or conversion into classification tasks, rather than supporting open-ended generation.


Besides the aforementioned datasets that focus on specific tasks, there are benchmarks like SUPERB (Yang et al., 2021b) and HEAR (Turian et al., 2022) for comprehensive evaluation of selfsupervised learning models. When it comes to assessing the ability of LALMs to follow instructions, Dynamic-SUPERB is the only benchmark dedicated to this aspect. However, Dynamic-SUPERB focuses on human speech processing and does not cover open-ended dialogue generation. In contrast, AIR-Bench is the first large-scale generative evaluation benchmark for large audio-language models, encompassing various audio types such as speech, natural sounds, and music.


Large Audio-Language Models following Human Instruction Recently, there has been significant interest in instruction-following end-to-end audio-language models for human-audio interaction. Several models have emerged, each focusing on different audio domains. For instance, there are models specifically focusing on speech processing, such as SpeechGPT (Zhang et al., 2023), BLSP (Wang et al., 2023a), and LLaSM (Shu et al., 2023). Similarly, there are models tailored for sound processing, like LTU (Gong et al., 2023), and for music processing, such as LLark (Gardner et al., 2023). In contrast, SALMONN (Tang et al., 2023b) and Qwen-Audio (Chu et al., 2023) are trained using various audio types, showcasing strong universal audio understanding abilities. However, these models are evaluated on different fundamental tasks, making it difficult to conduct a fair comparison. Furthermore, these models rely on showcasing examples or public demos to demonstrate their conversational skills, and do not perform rigorous experiments to evaluate their instruction-following abilities. To address these issues, this paper introduces AIR-Bench, which


Figure 1: The overview of AIR-Bench. AIR-Bench includes a range of ability dimensions, namely the foundation and chat abilities, which cater to various audio types such as speech, sound, and music. The foundational dimension comprises 19 distinct leaf abilities, each of which is assessed using a single-choice question format. The chat dimension assesses abilities through an open-ended question-and-answer format, incorporating diverse audio sources and mixed audio.


proposes two benchmarks - the foundation benchmark and the chat benchmark, enabling a fair comparison of the models’ foundational abilities and their high-level instruction-following capabilities respectively.


This paper is available on arxiv under CC BY 4.0 DEED license.