The paper presents AIR-Bench, the first benchmark designed for evaluating audio-language models (LALMs) with 19 audio tasks and over 21,000 questions. It includes a novel audio mixing strategy for real-world scenario simulation and a standardized evaluation framework for assessing model performance. A community leaderboard will be launched for consistent performance comparison over time.
(1) Qian Yang, Zhejiang University, Equal contribution. This work was conducted during Qian Yang’s internship at Alibaba Group;

(2) Jin Xu, Alibaba Group, Equal contribution;

(3) Wenrui Liu, Zhejiang University;

(4) Yunfei Chu, Alibaba Group;

(5) Xiaohuan Zhou, Alibaba Group;

(6) Yichong Leng, Alibaba Group;

(7) Yuanjun Lv, Alibaba Group;

(8) Zhou Zhao, Alibaba Group and Corresponding to Zhou Zhao ([email protected]);

(9) Yichong Leng, Zhejiang University

(10) Chang Zhou, Alibaba Group and Corresponding to Chang Zhou ([email protected]);

(11) Jingren Zhou, Alibaba Group.

Abstract and 1. Introduction

2 Related Work

3 AIR-Bench and 3.1 Overview

3.2 Foundation Benchmark

3.3 Chat Benchmark

3.4 Evaluation Strategy

4 Experiments

4.1 Models

4.2 Main Results

4.3 Human Evaluation and 4.4 Ablation Study of Positional Bias

5 Conclusion and References

A Detailed Results of Foundation Benchmark

5 Conclusion

In this paper, we present AIR-Bench, the first generative evaluation benchmark designed specifically for audio-language models. AIR-Bench comprises 19 audio tasks with over 19k single-choice questions in the foundation benchmark, as well as over 2k open-ended audio questions in the chat benchmark. Notably, the benchmark covers diverse audio types such as speech, natural sounds, and music. We also propose a novel audio mixing strategy to simulate audio from real-world scenarios more accurately. A standardized, objective, and reproducible evaluation framework is employed to automatically assess the quality of hypotheses generated by LALMs. We conduct a thorough evaluation of 9 prominent open-source LALMs. Additionally, we plan to launch and maintain a leaderboard that will serve as a platform for the community to access and compare model performance consistently over time.


This paper is available on arxiv under CC BY 4.0 DEED license.