This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
Authors:
(1) Dihia LANASRI, OMDENA, New York, USA;
(2) Juan OLANO, OMDENA, New York, USA;
(3) Sifal KLIOUI, OMDENA, New York, USA;
(4) Sin Liang Lee, OMDENA, New York, USA;
(5) Lamia SEKKAI, OMDENA, New York, USA.
The importance of hate speech detection on social networks has encouraged many researchers to build solutions (corpora and classifiers) to detect suspect messages. The literature review shows that most works are interested in text in structured languages like English, French, Arabic, etc. However, few works deal with dialects, mainly the Algerian one, which is known for its complexity and variety. To fill in the gap, we propose in this paper a complete NLP approach to detect hate speech in the Algerian dialect. We built an annotated corpus of more than 13,5K documents, which is used to evaluate various deep learning architectures. The obtained results are very promising, where the most accurate was the DzaraShield .
Looking ahead, there is significant potential to enhance inference speed, particularly for the Dziribert-based and multilingual models. While this project primarily focused on Arabic characters, our next step will be to address the dialect when written in Latin characters. Embracing both Arabic and Latin characters will more accurately capture the nuances of the written Algerian dialect. Finally, we plan to expand our corpus size and explore alternative deep-learning architectures.
We would like to thank every person who has contributed to this project: Micha Freidin, Viktor Ivanenko, Piyush Aaryan, Yassine Elboustani, Tasneem Elyamany, Cephars Bonacci, Nolan Wang and Lydia Khelifa Chibout. We would also like to thank Omdena organization for giving us this valuable opportunity
Sean MacAvaney, Hao-Ren Yao, Eugene Yang, Katina Russell, Nazli Goharian, and Ophir Frieder. Hate speech detection: Challenges and solutions. PloS one, 14(8):e0221152, 2019.
Mirela P Bogdani, Federico Faloppa, and Xheni Karaj. Beyond definitions. a call for action against hate speech in albania. a comprehensive study november 2021. 2021.
JT Nockleby. hate speech in encyclopedia of the american constitution. electronic journal of academic and special librarianship. 2000.
Imane Guellil, Ahsan Adeel, Faical Azouaou, Mohamed Boubred, Yousra Houichi, and Akram Abdelhaq Moumna. Ara-women-hate: An annotated corpus dedicated to hate speech detection against women in the arabic community. In Proceedings of the Workshop on Dataset Creation for Lower-Resourced Languages within the 13th Language Resources and Evaluation Conference, pages 68–75, 2022.
Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. Automated hate speech detection and the problem of offensive language. In Proceedings of the international AAAI conference on web and social media, volume 11, pages 512–515, 2017.
Paula Fortuna and Sérgio Nunes. A survey on automatic detection of hate speech in text. ACM Computing Surveys (CSUR), 51(4):1–30, 2018.
Anna Schmidt and Michael Wiegand. A survey on hate speech detection using natural language processing. In Proceedings of the fifth international workshop on natural language processing for social media, pages 1–10, 2017.
Delphine Battistelli, Cyril Bruneau, and Valentina Dragos. Building a formal model for hate detection in french corpora. Procedia Computer Science, 176:2358–2365, 2020.
Fatimah Alkomah and Xiaogang Ma. A literature review of textual hate speech detection methods and datasets. Information, 13(6):273, 2022.
Flor Miriam Plaza-del Arco, M Dolores Molina-González, L Alfonso Urena-López, and M Teresa Martín-Valdivia. Comparing pre-trained language models for spanish hate speech detection. Expert Systems with Applications, 166: 114120, 2021.
Nuha Albadi, Maram Kurdi, and Shivakant Mishra. Are they our brothers? analysis and detection of religious hate speech in the arabic twittersphere. In 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pages 69–76. IEEE, 2018.
Hala Mulki, Hatem Haddad, Chedi Bechikh Ali, and Halima Alshabani. L-hsab: A levantine twitter dataset for hate speech and abusive language. In Proceedings of the third workshop on abusive language online, pages 111–118, 2019.
Fréha Mezzoudj, Mourad Loukam, and Fatma Zohra Belkredim. Arabic algerian oranee dialectal language modelling oriented topic. International Journal of Informatics and Applied Mathematics, 2(2):1–14, 2019.
Oussama Boucherit and Kheireddine Abainia. Offensive language detection in under-resourced algerian dialectal arabic language. arXiv preprint arXiv:2203.10024, 2022.
Djamila Menifi, Wiam Moussa, and Ahmed Cherif Mazari. Transfer Learning and Deep Learning for Multilingual Algerian Dialect Hate Speech Detection. PhD thesis, 2022.
Adel Abdelli, Fayçal Guerrouf, Okba Tibermacine, and Belkacem Abdelli. Sentiment analysis of arabic algerian dialect using a supervised method. In 2019 International Conference on Intelligent Systems and Advanced Computing Sciences (ISACS), pages 1–6. IEEE, 2019.
Imane Guellil, Ahsan Adeel, Faical Azouaou, Mohamed Boubred, Yousra Houichi, and Akram Abdelhaq Moumna. Sexism detection: The first corpus in algerian dialect with a code-switching in arabic/french and english. arXiv preprint arXiv:2104.01443, 2021.
Ona De Gibert, Naiara Perez, Aitor García-Pablos, and Montse Cuadros. Hate speech dataset from a white supremacy forum. arXiv preprint arXiv:1809.04444, 2018.
Areej Al-Hassan and Hmood Al-Dossari. Detection of hate speech in social networks: a survey on multilingual corpus. In 6th international conference on computer science and information technology, volume 10, pages 10–5121, 2019.
Naaima Boudad, Rdouan Faizi, Rachid Oulad Haj Thami, and Raddouane Chiheb. Sentiment analysis in arabic: A review of the literature. Ain Shams Engineering Journal, 9(4):2479–2490, 2018.
Nizar Y Habash. Introduction to Arabic natural language processing. Springer Nature, 2022.
Abdul-Baquee M Sharaf and Eric Atwell. Qurana: Corpus of the quran annotated with pronominal anaphora. In Lrec, pages 130–137, 2012.
Kheireddine Abainia, Kenza Kara, and Tassadit Hamouni. A new corpus and lexicon for offensive tamazight language detection. In Proceedings of the 7th International Workshop on Social Media World Sensors, pages 1–6, 2022.
Ahmed Cherif Mazari and Hamza Kheddar. Deep learning-based analysis of algerian dialect dataset targeted hate speech, offensive language and cyberbullying. International Journal of Computing and Digital Systems, 2023.
Imane Guellil, Ahsan Adeel, Faical Azouaou, Sara Chennoufi, Hanene Maafi, and Thinhinane Hamitouche. Detecting hate speech against politicians in arabic community on social media. International Journal of Web Information Systems, 16(3):295–313, 2020.
Djamila Mohdeb, Meriem Laifa, Fayssal Zerargui, and Omar Benzaoui. Evaluating transfer learning approach for detecting arabic anti-refugee/migrant speech on social media. Aslib Journal of Information Management, 74(6): 1070–1088, 2022.
Reem ALBayari and Sherief Abdallah. Instagram-based benchmark dataset for cyberbullying detection in arabic text. Data, 7(7):83, 2022.
Monirah A Al-Ajlan and Mourad Ykhlef. Optimized twitter cyberbullying detection based on deep learning. In 2018 21st Saudi Computer Society National Computer Conference (NCC), pages 1–5. IEEE, 2018.
Batoul Haidar, Maroun Chamoun, and Ahmed Serhrouchni. Arabic cyberbullying detection: Enhancing performance by using ensemble machine learning. In 2019 international conference on internet of things (ithings) and ieee green computing and communications (greencom) and ieee cyber, physical and social computing (cpscom) and ieee smart data (smartdata), pages 323–327. IEEE, 2019.
Hamdy Mubarak, Kareem Darwish, and Walid Magdy. Abusive language detection on arabic social media. In Proceedings of the first workshop on abusive language online, pages 52–56, 2017.
Azalden Alakrot, Liam Murray, and Nikola S Nikolov. Dataset construction for the detection of anti-social behaviour in online communication in arabic. Procedia Computer Science, 142:174–181, 2018.
Hatem Haddad, Hala Mulki, and Asma Oueslati. T-hsab: A tunisian hate speech and abusive dataset. In International conference on Arabic language processing, pages 251–263. Springer, 2019.
Hamdy Mubarak, Ammar Rashed, Kareem Darwish, Younes Samih, and Ahmed Abdelali. Arabic offensive language on twitter: Analysis and experiments. arXiv preprint arXiv:2004.02192, 2020.
Hala Mulki and Bilal Ghanem. Let-mi: an arabic levantine twitter dataset for misogynistic language. arXiv preprint arXiv:2103.10195, 2021.
Rehab Duwairi, Amena Hayajneh, and Muhannad Quwaider. A deep learning framework for automatic detection of hate speech embedded in arabic tweets. Arabian Journal for Science and Engineering, 46:4001–4014, 2021.
Wassen Aldjanabi, Abdelghani Dahou, Mohammed AA Al-qaness, Mohamed Abd Elaziz, Ahmed Mohamed Helmi, and Robertas Damaševicius. Arabic offensive and hate speech detection using a cross-corpora multi-task learning ˇ model. In Informatics, volume 8, page 69. MDPI, 2021.
Batoul Haidar, Maroun Chamoun, and Ahmed Serhrouchni. A multilingual system for cyberbullying detection: Arabic content detection using machine learning. Advances in Science, Technology and Engineering Systems Journal, 2(6): 275–284, 2017.
Ahmed Abdelali, Kareem Darwish, Nadir Durrani, and Hamdy Mubarak. Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Demonstrations, pages 11–16, 2016.
Amine Abdaoui, Mohamed Berrimi, Mourad Oussalah, and Abdelouahab Moussaoui. Dziribert: a pre-trained language model for the algerian dialect. arXiv preprint arXiv:2109.12346, 2021.
Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, and Sayak Paul. Peft: State-of-the-art parameterefficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
E Moatez Billah Nagoudi, A Elmadany, and M Abdul-Mageed. Arat5: Text-to-text transformers for arabic language understanding and generation. arXiv preprint arXiv:2109.12068, 2021.