This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
Authors:
(1) Dihia LANASRI, OMDENA, New York, USA;
(2) Juan OLANO, OMDENA, New York, USA;
(3) Sifal KLIOUI, OMDENA, New York, USA;
(4) Sin Liang Lee, OMDENA, New York, USA;
(5) Lamia SEKKAI, OMDENA, New York, USA.
To identify hate speech in messages written in Algerian dialects—whether in Arabic or Latin script— we outline a comprehensive methodology encompassing (1) data gathering, (2) data annotation, (3) feature extraction, (4) model development, and (5) model evaluation and inference. We’ll delve into each of these stages in the subsequent sections.
Data collection serves as the foundational step in our approach. To effectively train our models, we require a robust dataset in the Algerian Arabic dialect. To achieve this, we sourced our data from three distinct social networks spanning the years 2017 to 2023:
1. YouTube: Numerous Algerian channels have emerged on YouTube, dedicated to discuss various topics, including politics, religion, social issues, youth concerns, education, and more. We have identified and focused on the most influential ones with a significant following and engagement. We employ the YouTube Data API through a Python script to gather comments from various videos.
3. Facebook: To gather data from Facebook, we selected public pages talking and sharing content about politics, Algerian products, pages of some influencers, mobile operators, etc. We collected the posts, comments, and replies from these various pages. To collect data, we used different solutions: (1) Between 2017 and 2018, we were able to collect data from any public page using Graph API. (2) Since 2019, we have used either FacePager free application to collect data from public pages or (3) Facebook-scraper library for scraping.
From these sources, we have collected more than 2 million documents (messages) in different languages: Arabic, French, English, dialect, etc. The next step consists of filtering only documents written in Algerian dialects, either in Arabic or Latin characters. This work was done manually by a group of collaborators. At the end, we obtained around 900K documents.
To annotate data, we followed two approaches: automatic and manual. We have decided to annotate only the dialect written in Arabic characters. Our approach consists of building one model that detects hate speech only for Algerian dialects written in Arabic characters. Then, a transliteration function is developed to transliterate any Algerian document
We used a binary annotation: 0 expressing NON-HATE, which represents a document that doesn’t contain any hateful or offensive word. 1 in case of a Hateful message containing any hateful word or the meaning and the semantics of the message expresses it.
1- Automatic annotation: For automatic annotation, we prepared a set of hateful keywords in the Algerian dialect discovered from our corpus. These words express the hate and the violence in Algerian speech. This list contains 1.298 words. This list of keywords has been used in a Python script to automatically tag a document with 1 if it contains at least one hateful keyword. In the other case, it is considered as 0. The automatically annotated corpus contains 200K Algerian documents written in Arabic characters.
2- Manual annotation: The automatically annotated documents have been validated manually. A group of annotators checked the annotated corpus and corrected the wrong-labeled documents. The manual step validates 5.644 documents considered for the next step.
3- Dataset Augmentation for Enhanced Balance: To bolster our dataset and enhance its equilibrium, we employed a strategy involving the incorporation of positively labeled subsets sourced from sentiment analysis datasets. In doing so, we reclassified these subsets as non-hateful, under the reasonable assumption that expressions of positive sentiment inherently exclude hate speech. Specifically, we leveraged the dataset available at https://www.kaggle.com/datasets/ djoughimehdi/algerian-dialect-review-for-sentiment-analysis, selecting solely the instances characterized by positive sentiment and relabeling them as ’normal.’ However, due to preprocessing constraints, this process yielded a reduced set of just 500 documents.
Moreover, we used the corpus shared by Boucherit and Abainia [2022] containing 8.7K documents in the Algerian dialect. This dataset is labeled manually as Offensive (3.227), Abusive (1.334), and Normal (4.188). We changed the labels of this corpus into Hateful (1) for fused Offensive and Abusive ones and Non-Hateful (0) for Normal ones. This corpus has been filtered and treated to keep 7.345 labeled documents.
At the end of this step, we obtained an annotated balanced corpus of 13.5K documents in Algerian dialect written in Arabic characters, which will be used later to build classifiers.
Before using any dataset, a cleaning or preprocessing task should be performed. We have defined a set of functions orchestrated in a pipeline, as illustrated in Figure 1.
• Remove URL: All URLs in a document are deleted.
• Remove stop words: The list of Arabic stop words provided by NLTK is used to clean meaningless words. This list has been enriched by a set of stop words detected in the Algerian dialect.
• Replace special punctuation: some punctuation concatenation can represent meaning and have an added value for the model, like: :) Means happy, :( Means upset, etc. This kind of punctuation is transformed into the corresponding emoji.
• Remove punctuation: All punctuation except the ones representing emotions are deleted.
• Remove repeating character: Any repeated character is removed, keeping just one occurrence.
• Remove extra white spaces: All extra white spaces are deleted.
• Remove Latin chars: Latin characters in Arabic text are removed to avoid incoherence.
• Remove digits: In the Arabic dialect, digits do not have any added value. They are removed
• Remove min-length words: Some words written in less than two positions are deleted. The experiences show that these words are meaningless.
The preprocessing of the data was meticulously crafted to cater to the unique characteristics of the Algerian dialect text. By applying rigorous preprocessing to the dialect, the data was made consistent and well-suited for training our models. These preprocessing steps were vital in ensuring that the model was sensitive to the nuances of the language and could effectively classify hate and non-hate content.
In the development of the models for binary classification of hate speech in the Algerian dialect, our corpus was loaded from CSV file and was stratified before it was split into three distinct sets: training (80%), validation (10%), and testing (10%). The stratification is based on the label column to maintain a balanced representation of hate (1) and non-hate (0) content in each subset. The training set facilitated the model training, while the validation set allowed for unbiased model evaluation during training to prevent over-fitting. The testing set served as an objective assessment of the model’s generalization performance beyond the training data. Tokenization, which includes padding and truncation, is also performed.
In this work, we evaluated many classifiers from machine and deep learning. Below, we discuss the architecture, methodology, and performance of each model.
1. Linear Support Vector Classifier (LinearSVC): We utilized a Linear Support Vector Classifier (LinearSVC) model to investigate how a traditional machine learning approach would perform on this task. The TF-IDF (Term Frequency-Inverse Document Frequency) method was employed to convert the text data into a numerical format suitable for machine learning models. The model was initialized with default parameters and trained on the feature matrix obtained from the TF-IDF vectorization.
2. Gzip + KNN: Deep Neural Networks are potent learners capable of tackling a wide array of tasks. However, for relatively straightforward tasks like topic classification they often prove excessive due to their substantial data requirements, high computational demands, and the need for meticulous hyper-parameter tuning. This part of the research centers on a more straightforward alternative known as "compressor-brd text classification," requiring no training parameters. which, despite its astonishing simplicity, exhibits interesting results. The approach comprises three key components: (1) utilization of a conventional lossless compression algorithm (gzip in this study); (2) application of the compressor-brd distance metric (Normalized Compression Distance in this study); (3) Implementation of a traditional KNN classifier.
3. LSTM & BiLSTM with Dziri FastText: LSTM and BiLSTM are one of the deep learning models that are suitable for NLP problems, mainly in text classification like sentiment analysis and even for hate speech detection. In this paper, we have tested these two models against our corpus. To learn the semantics and context of messages, we used FastText as a word embedding model. In our case, we fine tuned a Dziri FastText model. This later was trained on a huge dataset of Algerian messages in Arabic characters based on the Skip-gram model. The obtained model (Dziri FastText) is used to generate an embedding matrix for our built corpus of hate speech. The sequential architecture is composed of: (i) Embedding layer which is the input layer representing the embedding matrix; (ii) Dropout layer with a rate of 0.2 to prevent over-fitting; (iii) LSTM or Bidirectional LSTM layer with units=100, dropout=0.4, recurrent_dropout=0.2; (iv) Dropout layer with a rate of 0.2 to prevent over-fitting; (v) Output dense layer, using sigmoid as an activation function. As optimizer we used Adam, and we used binary crossentropy as a loss function, batch_size = 64 and epochs= 100.
4. Dziribert-FT-HEAD: Pre-trained transformers, like BERT, have become the standard in Natural Language Processing due to their exceptional performance in various tasks and languages. The authors in Abdaoui et al. [2021] collected over one million Algerian tweets and developed DziriBERT, the first Algerian language model, outperforming existing models, especially for the Latin script (Arabizi). This demonstrates that a specialized model trained on a relatively small dataset can outshine models trained on much larger datasets. The authors have made DziriBERT[3] publicly available to the community.
In this experiments we fine-tuned Dziribert, by incorporating a classification head while keeping the rest of the Dziribert parameters frozen. The classification head consists of three key components: a fully connected layer with 128 units, followed by batch normalization for stability, a dropout layer to mitigate overfitting, and a final fully connected layer that produces a single output value. We apply a sigmoid activation function to ensure the output falls between 0 and 1, which suits our binary classification task. Training employed the binary cross-entropy loss function and the Adam optimizer with a fixed learning rate of 1e-3. Additionally, a learning rate scheduler was employed to dynamically adjust the learning rate during training for improved convergence.
5. DZiriBert with Peft+LoRA: In our experiment, we fine-tuned the pre-trained model “DZiriBERT” using techniques called Peft (Parameter-Efficient Fine-Tuning) Mangrulkar et al. [2022] + LoRA Hu et al. [2021]. These methodologies allowed us to tailor the model specifically for the Algerian dialect, making it sensitive to the unique nuances of this language. The Peft configuration is established using the LoRa technique. Parameters such as the reduction factor, scaling factor, dropout rate, and bias are defined according to the task requirements.
Peft and LoRa Configuration: PEFT method has recently emerged as a powerful approach for adapting large-scale pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model’s parameters. Given that fine-tuning such models can be prohibitively costly, PEFT offers a viable alternative by only fine-tuning a small number of (extra) model parameters. This greatly decreases the computational and storage costs without compromising performance.
LoRA is a technique specifically designed to make the fine-tuning of large models more efficient and memory-friendly. The essential idea behind LoRA is to represent weight updates using two smaller matrices (referred to as update matrices) through a low-rank decomposition. While the original weight matrix remains frozen, these new matrices are trained to adapt to the new data, keeping the overall number of changes minimal. LoRA has many advantages, mainly the: (1) Efficiency: by significantly reducing the number of trainable parameters, LoRA makes fine-tuning more manageable. (2) Portability: Since the original pre-trained weights are kept frozen, multiple lightweight LoRA models can be created for various downstream tasks. (3) Performance: LoRA achieves performance comparable to fully fine-tuned models without adding any inference latency. (4) Versatility: Though typically applied to attention blocks in Transformer models, LoRA’s principles can, in theory, be applied to any subset of weight matrices in a neural network. Model Initialization: DZiriBERT is loaded and configured with Peft using the defined parameters. The model is then fine-tuned using the tokenized datasets. We configure our model using the LoraConfig class, which includes the following hyperparameters:
- Task Type: We set the task type to Sequence Classification (SEQ_CLS), where the model is trained to map an entire sequence of tokens to a single label. Target Modules: The target modules are set to "query" and "value".
- Rank (r): We employ a low-rank approximation with a rank =16 for the LoRA matrices.
- Scaling Factor (α): The LoRA layer utilizes a scaling factor=32, which serves as a regularization term.
- Dropout Rate: We introduce a dropout rate of 0.35 in the LoRA matrices to improve generalization.
- Bias: The bias term is set to "none," reducing the model complexity.
Training Process: The model is trained using custom training arguments, including learning rate, batch sizes, epochs, and evaluation strategies. The training process leverages the Hugging Face Trainer class, providing a streamlined approach to model fine-tuning. We train our model with the following parameters:
-learning_rate=1e-3: Specifies the learning rate as 1e-3. Learning rate controls how quickly or slowly a model learns during the training process.
-per_device_train_batch_size=16: This indicates that each device used for training (usually a GPU) will handle a batch of 16 samples during each training iteration.
- per_device_eval_batch_size=32: Similar to the above, but for evaluation, each device will process batches of 32 samples.
- num_train_epochs=5: The training process will go through the entire training dataset 5 times. An epoch is one complete forward and backward pass of all the training examples.
- weight_decay=0.01: This is a regularization technique that helps prevent the model from fitting the training data too closely (overfitting). A weight decay of 0.01 will be applied.
- evaluation_strategy="epoch": Evaluation will be performed at the end of each epoch. This allows you to check the performance of your model more frequently and make adjustments if needed.
- save_strategy="epoch": The model will be saved at the end of each epoch, allowing you to revert to the model’s state at the end of any given epoch if necessary.
- load_best_model_at_end=True: Once all training and evaluation are completed, the best-performing model will be loaded back into memory. This ensures that you always have access to the best model when your training is complete.
6. Dzarashield: We built the Dzarabert4 which is a modification of the original Dziribert model that involves pruning the embedding layer, specifically removing tokens that contain non-Arabic characters. This pruning significantly reduces the number of trainable parameters, resulting in faster training times and improved inference speed for the model. This approach is aimed at optimizing the model’s performance for tasks involving Arabic-based text while minimizing unnecessary complexity and computational overhead. Dzarashield[5] is built upon the Dzarabert base model by incorporating a classification head. This classification head consists of sequential architecture including: a linear layer (input: 768, output: 768), followed by a Rectified Linear Unit (ReLU) activation function; a dropout layer (dropout rate: 0.1); and another linear layer (input: 768, output: 2) for binary classification. The model’s hyperparameters were determined through experimentation: a learning rate (lr) of 1.3e-05, a batch size of 16, and training for [4] epochs. The Adam optimizer was used with its default parameters for optimization during training. Experimentation resulted in a better score when updating all the weights of the model rather than freezing the base BERT model and updating the classification head.
7. Multilingual E5 Model: We conducted a fine-tuning process on a pre-existing model, specifically the Multilingual E5 base model Wang et al. [2022]. Our primary objective was to ascertain the efficacy of a multilingual model within the context of the Algerian dialect. In adherence to the training methodology, the prefix "query:” was systematically introduced to each data row. This precautionary measure was deemed necessary Wang et al. [2022] to avert potential indications of performance deterioration that might arise in the absence of such preprocessing. The foundation of our investigation rested upon the initialization of the pre-trained base model using the xlm-roberta-base[6] architecture, which was trained on a mixture of multilingual datasets. The model is fine-tuned with an additional Dense layer followed by a Dropout Layer. The model is trained with custom hyperparameters for fine-tuning (Warmup Steps: 100; Weight Decay: 0.01 ; Epoch: 5 ; Probability of Dropout: 0.1; Train batch size: 16 ; Evaluation batch size: 64)
8. sbert-distill-multilingual Fine Tuned: Similar to the Multilingual E5 Model, we fine-tuned a pre-trained model known as sbert-distil-multilingual model from sentence transformer to investigate how well a multilingual model performs in Algerian Dialect. The pre-trained model is based on a fixed (monolingual) teacher model that produces sentence embeddings with our desired properties in one language. The student model is supposed to mimic the teacher model, i.e., the same English sentence should be mapped to the same vector by the teacher and by the student model. The model is fine-tuned with an additional Dropout layer and a GeLU layer via K-Fold cross validation. The model is trained with custom hyperparameters for fine-tuning (Warmup Steps: 100; Weight Decay: 0.01; Probability of Dropout: 0.1 ; Epoch: 10 ; K-Fold: 4 ; Train batch size: 16 ; Evaluation batch size: 64)
9 AraT5v2-HateDetect AraT5-base is the result of testing the T5 model (mT5)[7] on Arabic. For comparison, three robust Arabic T5-style models are pre-trained and evaluated on ARGEN dataset Nagoudi et al. [2021]. Surprisingly, despite being trained on approximately 49% less data, these models outperformed mT5 in the majority of
ARGEN tasks, achieving several new state-of-the-art results. The AraT5v2-base-1024 model [8] introduces several improvements compared to its predecessor, AraT5-base :
- More Data: AraT5v2-base-1024 is trained on a larger and more diverse Arabic dataset. This means it has been exposed to a wider range of Arabic text, enhancing its language understanding capabilities.
- Larger Sequence Length: This version increases the maximum sequence length from 512 to 1024. This extended sequence length allows the model to handle longer texts, making it more versatile in various NLP tasks.
- Faster Convergence: During the fine-tuning process, AraT5v2-base-1024 converges approximately 10 times faster than the previous version (AraT5-base). This can significantly speed up the training and fine-tuning processes, making it more efficient.
- Extra IDs: AraT5v2-base-1024 supports 100 sentinel tokens, also known as unique mask tokens. This allows for more flexibility and customization when using the model for specific tasks.
Overall, these enhancements make AraT5v2-base-1024 a more powerful and efficient choice for Arabic natural language processing tasks compared to its predecessor, and it is recommended for use in place of AraT5-base. AraT5v2-HateDetect[9] is a fine-tuned model based on AraT5v2-base-1024, specifically tailored for the hate detection task. The fine-tuning process involves conditioning the decoder’s labels, which include target input IDs and target attention masks, based on the encoder’s source documents, which consist of source input IDs and source attention masks. After experimentation, the following hyperparameters were chosen for training AraT5v2-HateDetect (Training Batch Size: 16; Learning Rate: 3e-5; Number of Training Epochs: 4). These hyperparameters were determined to optimize the model’s performance on the hate detection task. The chosen batch size, learning rate, and training epochs collectively contribute to the model’s ability to learn and generalize effectively for this specific NLP task.
To evaluate the different models, we used four main metrics: Accuracy, Precision, F1-Score, and Recall. To classify a message in case where it is written in Arabizi (a specific dialect using Latin characters), a transliteration process was implemented to convert the text into Arabic characters based on lang-trans[10] library.
[3] https://huggingface.co/alger-ia/dziribert
[4] https://huggingface.co/Sifal/dzarabert
[5] https://huggingface.co/Sifal/dzarashield
[6] https://huggingface.co/xlm-roberta-base
[7] https://huggingface.co/docs/transformers/model_doc/mt5
[8] https://huggingface.co/UBC-NLP/AraT5v2-base-1024
[9] https://huggingface.co/Sifal/AraT5v2-HateDetect
[10] https://pypi.org/project/lang-trans/