Authors:
(1) Abraham Owodunni, Intron Health, Masakhane, and this author contributed equally;
(2) Aditya Yadavalli, Karya, Masakhane, and this author contributed equally;
(3) Chris Emezuem, Mila Quebec AI Institute, Lanfrica, Masakhane, and this author contributed equally;
(4) Tobi Olatunji, Intron Health and Masakhane, and this author contributed equally;
(5) Clinton Mbataku, AI Saturdays Lagos.
4 What information does AccentFold capture?
5 Empirical study of AccentFold
6 Conclusion, Limitations, and References
Despite advancements in speech recognition, accented speech remains challenging. While previous approaches have focused on modeling techniques or creating accented speech datasets, gathering sufficient data for the multitude of accents, particularly in the African context, remains impractical due to their sheer diversity and associated budget constraints. To address these challenges, we propose AccentFold, a method that exploits spatial relationships between learned accent embeddings to improve downstream Automatic Speech Recognition (ASR). Our exploratory analysis of speech embeddings representing 100+ African accents reveals interesting spatial accent relationships highlighting geographic and genealogical similarities, capturing consistent phonological, and morphological regularities, all learned empirically from speech. Furthermore, we discover accent relationships previously uncharacterized by the Ethnologue. Through empirical evaluation, we demonstrate the effectiveness of AccentFold by showing that, for out-ofdistribution (OOD) accents, sampling accent subsets for training based on AccentFold information outperforms strong baselines with a relative WER improvement of 4.6%. AccentFold presents a promising approach for improving ASR performance on accented speech, particularly in the context of African accents, where data scarcity and budget constraints pose significant challenges. Our findings emphasize the potential of leveraging linguistic relationships to improve zero-shot ASR adaptation to target accents. Please find our code for this work here.[1]
English language is spoken in 88 countries and territories as either an official, administrative, or cultural language, estimated at over 2 billion speakers with non-native speakers outnumbering native speakers by a ratio of 3:1.
Despite considerable advancements, automatic speech recognition (ASR) technology still faces challenges with accented speech (Yadavalli et al., 2022b; Szalay et al., 2022; Sanabria et al., 2023). Speakers whose first language (L1) is not English have high word error rate for their audio samples (DiChristofano et al., 2022). Koenecke et al. (2020) showed that existing ASR systems struggle with speakers of African American Vernacular English (AAVE) when compared with speech from rural White Californians.
The dominant methods for improving speech recognition for accented speech have conventionally involved modeling techniques and algorithmic enhancements such as multitask learning (Jain et al., 2018; Zhang et al., 2021; Yadavalli et al., 2022a; Li et al., 2018), domain adversarial training (Feng et al., 2021; Li et al., 2021a), active learning (Chella Priyadharshini et al., 2018), and weak supervision (Khandelwal et al., 2020). Despite some progress in ASR performance, performance still degrades significantly for out-of-distribution (OOD) accents, making the application of these techniques in real-world scenarios challenging. To enhance generalizability, datasets that incorporate accented speech have been developed (Ardila et al., 2019; Sanabria et al., 2023). However, given the sheer number of accents, it is currently infeasible to obtain a sufficient amount of data that comprehensively covers each distinct accent.
In contrast, there has been a relatively smaller focus on exploring linguistic aspects, accent relationships, and harnessing that knowledge to enhance ASR performance. Previous research in language modeling (Nzeyimana and Rubungo, 2022), intent classification (Sharma et al., 2021) and speech recognition (Toshniwal et al., 2018; Li et al., 2021b; Jain et al., 2023) have demonstrated that incorporating linguistic information in NLP tasks generally yields downstream improvements, especially for languages with limited resources and restricted data availability – a situation pertinent to African languages. Consequently, we opine that a deeper understanding of geographical and linguistic similarities, encompassing syntactic, phonological, and morphological aspects, among different accents can potentially enhance ASR for accented speech.
We believe embeddings offer a principled and quantitative approach to investigate linguistic, geographic and other global connections (Mikolov et al., 2013; Garg et al., 2018), and form the framework of our paper. Our contribution involves the development of AccentFold, a network of learned accent embeddings through which we explore possible linguistic and geographic relationships among African accents. We report the insights from our linguistic analysis in Section 4.
We believe embeddings offer a principled and quantitative approach to investigate linguistic, geographic and other global connections (Mikolov et al., 2013; Garg et al., 2018), and form the framework of our paper. Our contribution involves the development of AccentFold, a network of learned accent embeddings through which we explore possible linguistic and geographic relationships among African accents. We report the insights from our linguistic analysis in Section 4.
This paper is available on arxiv under CC BY-SA 4.0 DEED license.
[1] https://github.com/intron-innovation/accent_ folds