AccentFold: Enhancing Accent Recognition - Related Work

Authors: (1) Abraham Owodunni, Intron Health, Masakhane, and this author contributed equally; (2) Aditya Yadavalli, Karya, Masakhane, and this author contributed equally; (3) Chris Emezuem, Mila Quebec AI Institute, Lanfrica, Masakhane, and this author contributed equally; (4) Tobi Olatunji, Intron Health and Masakhane, and this author contributed equally; (5) Clinton Mbataku, AI Saturdays Lagos. Table of Links Abstract and 1 Introduction 2 Related Work 3 AccentFold 4 What information does AccentFold capture? 5 Empirical study of AccentFold 6 Conclusion, Limitations, and References 2 Related Work Using existing state-of-art pre-trained models to probe for linguistic information and using that to improve models’ performance has gained interest in the community recently. Prasad and Jyothi (2020) use various probing techniques on the DeepSpeech 2 model (Amodei et al., 2015). They find that first few layers encode most of the accent related information. Bartelds and Wieling (2022) quantify language variation in Dutch using a combination of XLS-53 (Conneau et al., 2020) embeddings and Dynamic Time Warping (Sakoe and Chiba, 1978). They show that this leads to a Dutch dialect identification system that is better than a system dependent on the phonetic transcriptions with just six seconds of speech. Thus, proving that pre-trained models such as the one proposed by Conneau et al. (2020) indeed capture rich linguistic information in their representations. Jain et al. (2018); Li et al. (2021a) extract accent embeddings learnt from a separate network and input those embeddings along with other features. They show that this leads to a superior accented ASR model. Our work is most closely related to (Kothawade et al.,2023), where the authors explore various statistical methods such as Submodular Mutual Information in combination with hand-crafted features to select a subset of data to improve accented ASR. Our work differs from previous works in two important ways (1) we take a different approach and use the extracted accent embeddings from a pre-trained model to decide what subset of data to use to build an ASR that performs the best on a target accent in a cost-effective manner (2) we do this at a much larger scale of 41 African English accents. Note that the previous highest was 21 English accents by Li et al. (2021a). This paper is available on arxiv under CC BY-SA 4.0 DEED license. Authors: (1) Abraham Owodunni, Intron Health, Masakhane, and this author contributed equally; (2) Aditya Yadavalli, Karya, Masakhane, and this author contributed equally; (3) Chris Emezuem, Mila Quebec AI Institute, Lanfrica, Masakhane, and this author contributed equally; (4) Tobi Olatunji, Intron Health and Masakhane, and this author contributed equally; (5) Clinton Mbataku, AI Saturdays Lagos. Authors: Authors: (1) Abraham Owodunni, Intron Health, Masakhane, and this author contributed equally; (2) Aditya Yadavalli, Karya, Masakhane, and this author contributed equally; (3) Chris Emezuem, Mila Quebec AI Institute, Lanfrica, Masakhane, and this author contributed equally; (4) Tobi Olatunji, Intron Health and Masakhane, and this author contributed equally; (5) Clinton Mbataku, AI Saturdays Lagos. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 AccentFold 3 AccentFold 4 What information does AccentFold capture? 4 What information does AccentFold capture? 5 Empirical study of AccentFold 5 Empirical study of AccentFold 6 Conclusion, Limitations, and References 6 Conclusion, Limitations, and References 2 Related Work Using existing state-of-art pre-trained models to probe for linguistic information and using that to improve models’ performance has gained interest in the community recently. Prasad and Jyothi (2020) use various probing techniques on the DeepSpeech 2 model (Amodei et al., 2015). They find that first few layers encode most of the accent related information. Bartelds and Wieling (2022) quantify language variation in Dutch using a combination of XLS-53 (Conneau et al., 2020) embeddings and Dynamic Time Warping (Sakoe and Chiba, 1978). They show that this leads to a Dutch dialect identification system that is better than a system dependent on the phonetic transcriptions with just six seconds of speech. Thus, proving that pre-trained models such as the one proposed by Conneau et al. (2020) indeed capture rich linguistic information in their representations. Jain et al. (2018); Li et al. (2021a) extract accent embeddings learnt from a separate network and input those embeddings along with other features. They show that this leads to a superior accented ASR model. Our work is most closely related to (Kothawade et al.,2023), where the authors explore various statistical methods such as Submodular Mutual Information in combination with hand-crafted features to select a subset of data to improve accented ASR. Our work differs from previous works in two important ways (1) we take a different approach and use the extracted accent embeddings from a pre-trained model to decide what subset of data to use to build an ASR that performs the best on a target accent in a cost-effective manner (2) we do this at a much larger scale of 41 African English accents. Note that the previous highest was 21 English accents by Li et al. (2021a). This paper is available on arxiv under CC BY-SA 4.0 DEED license. This paper is available on arxiv under CC BY-SA 4.0 DEED license. available on arxiv