This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
Authors:
(1) Dihia LANASRI, OMDENA, New York, USA;
(2) Juan OLANO, OMDENA, New York, USA;
(3) Sifal KLIOUI, OMDENA, New York, USA;
(4) Sin Liang Lee, OMDENA, New York, USA;
(5) Lamia SEKKAI, OMDENA, New York, USA.
With the proliferation of hate speech on social networks under different formats, such as abusive language, cyberbullying, and violence, etc., people have experienced a significant increase in violence, putting them in uncomfortable situations and threats. Plenty of efforts have been dedicated in the last few years to overcome this phenomenon to detect hate speech in different structured languages like English, French, Arabic, and others. However, a reduced number of works deal with Arabic dialects like Tunisian, Egyptian, and Gulf, mainly the Algerian ones. To fill in the gap, we propose in this work a complete approach for detecting hate speech on online Algerian messages. Many deep learning architectures have been evaluated on the corpus we created from some Algerian social networks (Facebook, YouTube, and Twitter). This corpus contains more than 13.5K documents in Algerian dialect written in Arabic, labeled as hateful or non-hateful. Promising results are obtained, which show the efficiency of our approach.
Keywords Hate Speech · Algerian dialect · Deep Learning · DziriBERT · FastText
Hate speech detection, or detection of offensive messages in social networks, communication forums, and websites, is an exciting and hot research topic. Many hate crimes and attacks in our current life started from social network posts and comments MacAvaney et al. [2019]. Studying this phenomenon is imperative for online communities to keep a safe environment for their users. It also has a significant benefit for security authorities and states to ensure the safety of citizens and prevent crimes and attacks.
A universally accepted definition of hate speech is currently unavailable Bogdani et al. [2021] because of the variation of cultures, societies, and local languages. Other difficulties include the diversity of national laws, the variety of online communities, and forms of online hate speech. Various definitions are proposed.
According to the Encyclopedia of the American Constitution: "Hate speech is speech that attacks a person or group based on attributes such as race, religion, ethnic origin, national origin, sex, disability, sexual orientation, or gender identity." Nockleby [2000]. Today, many authors largely used this definition Guellil et al. [2022]. Facebook considers hate speech as "a direct attack on people based on protected characteristics—race, ethnicity, national origin, religious affiliation, sexual orientation, caste, sex, gender, gender identity, and serious disease or disability. We also provide some protections for immigration status." [1] . Davidson et al., who defines hate speech as "language that is used to express hatred towards a targeted group or is intended to be derogatory, to humiliate, or to insult the members of the group" propose one of the most accepted definitions Davidson et al. [2017]. Alternatively, the one proposed by Fortuna et al., "Hate speech is a language that attacks or diminishes, that incites violence or hate against groups, based on specific characteristics such as physical appearance, religion, descent, national or ethnic origin, sexual orientation, gender identity or other, and it can occur with different linguistic styles, even in subtle forms or when humor is used." Fortuna and Nunes [2018].
The literature review shows that the term Hate speech (which is the most commonly used) has various synonym terms such as abusive speech, offensive language, cyberbullying, or sexism detection Schmidt and Wiegand [2017]. Many works have been published in the context of hate speech detection for different standard and structured languages, like French Battistelli et al. [2020], English Alkomah and Ma [2022], Spanish Plaza-del Arco et al. [2021], and Arabic Albadi et al. [2018]. These languages are known for their standardization with well-known grammar and structure, which make the language processing well mastered. However, detecting hate speech in dialects, mainly Arabic ones such as Libyan, Egyptian, and Iraqi, etc. is still challenging and complex work Mulki et al. [2019]. Even if they are derived from the literal Arabic language, each country’s specific vocabulary and semantics are added or defined.
In this work, we are interested in detecting hate speech in the Algerian dialect. This latter is one of the complex dialects Mezzoudj et al. [2019] characterized by the variety of its sub-dialects according to each region within the country. Algeria is a country with 58 regions; each one has a specificity in its spoken language with different words and meanings. The same word may have various meanings for each region; for example, ’Flouka’ in the east means ’earrings.’ In the north, it means ’small boat’.
Moreover, new ’odd’ words are continually added to the Algerian vocabulary. The Algerian dialect is known for its morphological and orthographic richness. Facing this situation, treating and understanding the Algerian dialect for hate speech detection is a complex work. The importance of this project for the Algerian context encourages us to work on this problem.
To the best of our knowledge, only few works have been proposed for hate speech detection in the Algerian dialect Boucherit and Abainia [2022], Menifi et al. [2022]. Some other related topics are treated like sentiment analysis Abdelli et al. [2019], sexism detection Guellil et al. [2021] which may be exploited to analyze the hate speech.
In this paper, we proposed a complete end-to-end natural language processing (NLP) approach for hate speech detection in the Algerian dialect. Our approach covers the main steps of an NLP project, including data collection, data annotation, feature extraction, and then model development based on machine and deep learning, model evaluation, and inference.
In this paper, we proposed a complete end-to-end natural language processing (NLP) approach for hate speech detection in the Algerian dialect. Our approach covers the main steps of an NLP project, including data collection, data annotation, feature extraction, and then model development based on machine and deep learning, model evaluation, and inference.
This paper is structured as follows: Section 2 presents a necessary background, Section 3 reviews the most important related works, Section 4 details our proposed approach and evaluated models, Section 5 discusses the obtained results, and Section 6 concludes the paper.
[1] Community Standards; Available on:https://www.facebook.com/communitystandards/objectionable_content