Authors:
(1) Aviad Rom, The Data Science Institute, Reichman University, Herzliya, Israel;
(2) Kfir Bar, The Data Science Institute, Reichman University, Herzliya, Israel. Table of Links Abstract and Introduction Related Work Methodology Experimental Settings Results Conclusion and Limitations Bibliographical References 6. Conclusion Arabic and Hebrew, both Semitic languages, display inherent structural resemblances and possess shared cognates. In an endeavor to allow a bilingual language model to recognize these cognates, we introduced a novel language model tailored for Arabic and Hebrew, wherein the Arabic text is transliterated into the Hebrew script prior to both pre-training and fine-tuning. We contrast our model by training another language model on the identical dataset but without the transliteration preprocessing step, in order to assess the impact of transliteration. We fine-tuned our model for the machine translation task, yielding promising outcomes. These results suggest that the transliteration step offers tangible benefits to the translation process. Comparing to the translation combination involving other language models, we see comparable results; this is encouraging given that the training data we utilized for pre-training the model is approximately 60% smaller than theirs. As a future avenue of research, we intend to train the model on an expanded dataset and explore scaling its architecture. In this study, our emphasis was on a transformer encoder. We are keen to investigate the effects of implementing transliteration within a decoder architecture, once such a model becomes available for Hebrew. Limitations The transliteration algorithm from Arabic to Hebrew is based a simple deterministic lookup table. However, sometimes the transliteration is not that straight forward, and this simple algorithm generates some odd rendering, which we would like to fix. For example, our algorithm does not place a final-form letter at the end of the Arabic word in Hebrew. Another challenge with transliteration into Hebrew is that for some words a Hebrew writer may choose to omit long vowel characters and the readers will still be able to understand the word. This phenomenon is referred to as “Ktiv Hasser” in Hebrew. Yet, there exists a preference for certain word representations over others. This inconsistency makes it more challenging for aligning the transliterated Arabic words to their cognates in the way they are written. Our transliteration algorithm always renders the Arabic word following the Arabic letters, which may be different than how this word is typically written in Hebrew. Another limitation is the relatively small size of the dataset which we used for pre-training the language model, comparing to other existing language models of the same architecture size. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Aviad Rom, The Data Science Institute, Reichman University, Herzliya, Israel; (2) Kfir Bar, The Data Science Institute, Reichman University, Herzliya, Israel. Authors: Authors: (1) Aviad Rom, The Data Science Institute, Reichman University, Herzliya, Israel; (2) Kfir Bar, The Data Science Institute, Reichman University, Herzliya, Israel. Table of Links Abstract and Introduction Abstract and Introduction Related Work Related Work Methodology Methodology Experimental Settings Experimental Settings Results Results Conclusion and Limitations Conclusion and Limitations Bibliographical References Bibliographical References 6. Conclusion Arabic and Hebrew, both Semitic languages, display inherent structural resemblances and possess shared cognates. In an endeavor to allow a bilingual language model to recognize these cognates, we introduced a novel language model tailored for Arabic and Hebrew, wherein the Arabic text is transliterated into the Hebrew script prior to both pre-training and fine-tuning. We contrast our model by training another language model on the identical dataset but without the transliteration preprocessing step, in order to assess the impact of transliteration. We fine-tuned our model for the machine translation task, yielding promising outcomes. These results suggest that the transliteration step offers tangible benefits to the translation process. Comparing to the translation combination involving other language models, we see comparable results; this is encouraging given that the training data we utilized for pre-training the model is approximately 60% smaller than theirs. As a future avenue of research, we intend to train the model on an expanded dataset and explore scaling its architecture. In this study, our emphasis was on a transformer encoder. We are keen to investigate the effects of implementing transliteration within a decoder architecture, once such a model becomes available for Hebrew. Limitations The transliteration algorithm from Arabic to Hebrew is based a simple deterministic lookup table. However, sometimes the transliteration is not that straight forward, and this simple algorithm generates some odd rendering, which we would like to fix. For example, our algorithm does not place a final-form letter at the end of the Arabic word in Hebrew. Another challenge with transliteration into Hebrew is that for some words a Hebrew writer may choose to omit long vowel characters and the readers will still be able to understand the word. This phenomenon is referred to as “Ktiv Hasser” in Hebrew. Yet, there exists a preference for certain word representations over others. This inconsistency makes it more challenging for aligning the transliterated Arabic words to their cognates in the way they are written. Our transliteration algorithm always renders the Arabic word following the Arabic letters, which may be different than how this word is typically written in Hebrew. Another limitation is the relatively small size of the dataset which we used for pre-training the language model, comparing to other existing language models of the same architecture size. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Reflections on HeArBERT: Transliteration Benefits, Research Horizons, and Limitations

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

AI Breakthrough Sharpens Telescope Images-Astronomy’s Next Big Leap

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

AI Breakthrough Sharpens Telescope Images-Astronomy’s Next Big Leap

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps