paint-brush
HeArBERT: A Bilingual Model for Arabic-Hebrew Translation Using Transliterationby@morphology

HeArBERT: A Bilingual Model for Arabic-Hebrew Translation Using Transliteration

by MorphologySeptember 10th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

HeArBERT is a new bilingual Arabic-Hebrew language model that uses Arabic transliterated into Hebrew script, achieving notable improvements in machine translation. This approach contrasts with previous models like mBERT, which face limitations due to data representation and script differences. The study highlights the advantages of script unification and addresses gaps in existing research on transliteration effects in bilingual NLP tasks.
featured image - HeArBERT: A Bilingual Model for Arabic-Hebrew Translation Using Transliteration
Morphology HackerNoon profile picture

Authors:

(1) Aviad Rom, The Data Science Institute, Reichman University, Herzliya, Israel;

(2) Kfir Bar, The Data Science Institute, Reichman University, Herzliya, Israel.

Abstract and Introduction

Related Work

Methodology

Experimental Settings

Results

Conclusion and Limitations

Bibliographical References

K et al. (2020) have suggested that structural similarity of languages is essential for language model’s multilingual generalization capabilities. Their suggestion was further discussed by Dufter and Schütze (2020), who highlighted the essential components for a model to possess “multilinguality”, and show that the order of the words in the sentence is key to the model’s cross-lingual generalization capabilities. mBERT, as introduced by Devlin et al. (2019), was a pioneering language model that encompassed multiple languages, including Arabic and Hebrew. However, both Arabic and Hebrew are significantly under-represented in the pre-training data, resulting in inferior performance compared to the equivalent monolingual models on various downstream tasks (Antoun et al., 2020; Lan et al., 2020; Chriqui and Yahav, 2022; Seker et al., 2022). GigaBERT (Lan et al., 2020) is another multilingual model, trained for English and Arabic. However, the best results for most of the known NLP tasks are typically achieved by one of the large monolingual models in both Arabic and Hebrew. CAMeLBERT (Inoue et al., 2021), is one of those models. It is trained on texts written in Modern Standard Arabic (MSA), Classical Arabic, as well as dialectal variants. In the realm of Hebrew language models, AlephBERT (Seker et al., 2022) stands out as one of the leading performers, alongside others like HeBERT (Chriqui and Yahav, 2022). Among other datasets, the monolingual models mentioned above use the relevant parts of the OSCAR dataset (Ortiz Suárez et al., 2020) for training. Our model relies solely on the OSCAR dataset for both Hebrew and Arabic, resulting in a considerably smaller total number of words for each language in comparison to the existing monolingual language models.


The effect of transliteration on cross-lingual generalization were discussed previously in (Dhamecha et al., 2021; Chau and Smith, 2021) and more recently in (Moosa et al., 2023; Purkayastha et al., 2023). None of these works study the languages of our focus: Arabic and Hebrew. Dhamecha et al. (2021) focused on languages from the Indo-Aryan family, which has been studied before for cross-lingual generalization and also has several publicly available multilingual models. To the best of our knowledge, our work is first to study generalization between Arabic and Hebrew and no multilingual masked language models that include both languages have been published apart from mBERT.


Chau and Smith (2021) address the generalization from high- to low-resourced languages. However, both Arabic and Hebrew are currently considered medium- to high-resourced languages. Furthermore, their evaluation focuses solely on tokenlevel classification tasks, such as dependency parsing and part-of-speech tagging, whereas our evaluation targets machine translation, a sequence-to-sequence bilingual task.


Purkayastha et al. (2023) employ Romanization for transliteration, whereas we transliterate Arabic into the Hebrew script. Analogous to Chau and Smith (2021), their evaluation centers on token-level classification tasks, which are not addressed in our work.


This paper is available on arxiv under CC BY 4.0 DEED license.