Authors:
(1) Aviad Rom, The Data Science Institute, Reichman University, Herzliya, Israel;
(2) Kfir Bar, The Data Science Institute, Reichman University, Herzliya, Israel. Table of Links Abstract and Introduction Related Work Methodology Experimental Settings Results Conclusion and Limitations Bibliographical References 2. Related Work K et al. (2020) have suggested that structural similarity of languages is essential for language model’s multilingual generalization capabilities. Their suggestion was further discussed by Dufter and Schütze (2020), who highlighted the essential components for a model to possess “multilinguality”, and show that the order of the words in the sentence is key to the model’s cross-lingual generalization capabilities. mBERT, as introduced by Devlin et al. (2019), was a pioneering language model that encompassed multiple languages, including Arabic and Hebrew. However, both Arabic and Hebrew are significantly under-represented in the pre-training data, resulting in inferior performance compared to the equivalent monolingual models on various downstream tasks (Antoun et al., 2020; Lan et al., 2020; Chriqui and Yahav, 2022; Seker et al., 2022). GigaBERT (Lan et al., 2020) is another multilingual model, trained for English and Arabic. However, the best results for most of the known NLP tasks are typically achieved by one of the large monolingual models in both Arabic and Hebrew. CAMeLBERT (Inoue et al., 2021), is one of those models. It is trained on texts written in Modern Standard Arabic (MSA), Classical Arabic, as well as dialectal variants. In the realm of Hebrew language models, AlephBERT (Seker et al., 2022) stands out as one of the leading performers, alongside others like HeBERT (Chriqui and Yahav, 2022). Among other datasets, the monolingual models mentioned above use the relevant parts of the OSCAR dataset (Ortiz Suárez et al., 2020) for training. Our model relies solely on the OSCAR dataset for both Hebrew and Arabic, resulting in a considerably smaller total number of words for each language in comparison to the existing monolingual language models. The effect of transliteration on cross-lingual generalization were discussed previously in (Dhamecha et al., 2021; Chau and Smith, 2021) and more recently in (Moosa et al., 2023; Purkayastha et al., 2023). None of these works study the languages of our focus: Arabic and Hebrew. Dhamecha et al. (2021) focused on languages from the Indo-Aryan family, which has been studied before for cross-lingual generalization and also has several publicly available multilingual models. To the best of our knowledge, our work is first to study generalization between Arabic and Hebrew and no multilingual masked language models that include both languages have been published apart from mBERT. Chau and Smith (2021) address the generalization from high- to low-resourced languages. However, both Arabic and Hebrew are currently considered medium- to high-resourced languages. Furthermore, their evaluation focuses solely on tokenlevel classification tasks, such as dependency parsing and part-of-speech tagging, whereas our evaluation targets machine translation, a sequence-to-sequence bilingual task. Purkayastha et al. (2023) employ Romanization for transliteration, whereas we transliterate Arabic into the Hebrew script. Analogous to Chau and Smith (2021), their evaluation centers on token-level classification tasks, which are not addressed in our work. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Aviad Rom, The Data Science Institute, Reichman University, Herzliya, Israel; (2) Kfir Bar, The Data Science Institute, Reichman University, Herzliya, Israel. Authors: Authors: (1) Aviad Rom, The Data Science Institute, Reichman University, Herzliya, Israel; (2) Kfir Bar, The Data Science Institute, Reichman University, Herzliya, Israel. Table of Links Abstract and Introduction Abstract and Introduction Related Work Related Work Methodology Methodology Experimental Settings Experimental Settings Results Results Conclusion and Limitations Conclusion and Limitations Bibliographical References Bibliographical References 2. Related Work K et al. (2020) have suggested that structural similarity of languages is essential for language model’s multilingual generalization capabilities. Their suggestion was further discussed by Dufter and Schütze (2020), who highlighted the essential components for a model to possess “multilinguality”, and show that the order of the words in the sentence is key to the model’s cross-lingual generalization capabilities. mBERT, as introduced by Devlin et al. (2019), was a pioneering language model that encompassed multiple languages, including Arabic and Hebrew. However, both Arabic and Hebrew are significantly under-represented in the pre-training data, resulting in inferior performance compared to the equivalent monolingual models on various downstream tasks (Antoun et al., 2020; Lan et al., 2020; Chriqui and Yahav, 2022; Seker et al., 2022). GigaBERT (Lan et al., 2020) is another multilingual model, trained for English and Arabic. However, the best results for most of the known NLP tasks are typically achieved by one of the large monolingual models in both Arabic and Hebrew. CAMeLBERT (Inoue et al., 2021), is one of those models. It is trained on texts written in Modern Standard Arabic (MSA), Classical Arabic, as well as dialectal variants. In the realm of Hebrew language models, AlephBERT (Seker et al., 2022) stands out as one of the leading performers, alongside others like HeBERT (Chriqui and Yahav, 2022). Among other datasets, the monolingual models mentioned above use the relevant parts of the OSCAR dataset (Ortiz Suárez et al., 2020) for training. Our model relies solely on the OSCAR dataset for both Hebrew and Arabic, resulting in a considerably smaller total number of words for each language in comparison to the existing monolingual language models. The effect of transliteration on cross-lingual generalization were discussed previously in (Dhamecha et al., 2021; Chau and Smith, 2021) and more recently in (Moosa et al., 2023; Purkayastha et al., 2023). None of these works study the languages of our focus: Arabic and Hebrew. Dhamecha et al. (2021) focused on languages from the Indo-Aryan family, which has been studied before for cross-lingual generalization and also has several publicly available multilingual models. To the best of our knowledge, our work is first to study generalization between Arabic and Hebrew and no multilingual masked language models that include both languages have been published apart from mBERT. Chau and Smith (2021) address the generalization from high- to low-resourced languages. However, both Arabic and Hebrew are currently considered medium- to high-resourced languages. Furthermore, their evaluation focuses solely on tokenlevel classification tasks, such as dependency parsing and part-of-speech tagging, whereas our evaluation targets machine translation, a sequence-to-sequence bilingual task. Purkayastha et al. (2023) employ Romanization for transliteration, whereas we transliterate Arabic into the Hebrew script. Analogous to Chau and Smith (2021), their evaluation centers on token-level classification tasks, which are not addressed in our work. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

HeArBERT: A Bilingual Model for Arabic-Hebrew Translation Using Transliteration

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

AI Breakthrough Sharpens Telescope Images-Astronomy’s Next Big Leap

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

AI Breakthrough Sharpens Telescope Images-Astronomy’s Next Big Leap

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps