Hate Speech Detection in Algerian Dialect Using Deep Learning: Related Work

cover
14 Feb 2024

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Dihia LANASRI, OMDENA, New York, USA;

(2) Juan OLANO, OMDENA, New York, USA;

(3) Sifal KLIOUI, OMDENA, New York, USA;

(4) Sin Liang Lee, OMDENA, New York, USA;

(5) Lamia SEKKAI, OMDENA, New York, USA.

A restricted number of works have been published in the context of hate speech detection dealing with Arabic dialects. This section will analyze the most important NLP approaches dedicated to (1) the Algerian dialect and (2) Other Arabic dialects like Iraqi, Egyptian, Syrian, and Tunisian. This analysis helps us to identify the used approaches, models, and corpora.

3.1 Hate speech detection in Algerian dialect

Guellil et al. [2022, 2021] Developed the first approach and corpus in Algerian dialect for hate speech detection against women in Arabic community on social media. This corpus contains more than 373K YouTube comments. Two different algorithms for feature extraction were used: Word2vec with machine learning models (GaussianNB, LogisticRegression, RandomForset, SGDClassifier, and LinearSVC) and FastText with Deep learning models (deep Convolutional Neural Network (CNN), long short-term memory (LSTM) network and Bi-directional LSTM (BiLSTM) network). Simulation results demonstrated the best performance of the CNN model with FastText.

Boucherit and Abainia [2022] addressed the problem of detecting offensive and abusive content in Facebook comments. The corpus contains 8.7K comments in Algerian dialect written in Arabic and Latin characters, manually annotated as usual, abusive, and offensive. They used BiLSTM, CNN, FastText, SVM, and Multinomial Naive Bayes (NB) as classifiers. The experimental results showed that SVM and Multinomial NB classifiers outperformed all the other classifiers.

Abainia et al. [2022] addressed the offensive language detection in the Amazigh language, which is one of the underresourced languages. They were interested in the Kabyle dialect. A new corpus of offensive Amazigh language is proposed containing 6.2K documents collected from Facebook and manually annotated as usual or offensive. A new lexicon of offensive and abusive Amazigh words with 12.6k entries is also developed. Many models have been evaluated, like SVM and Multinomial Naive Bayes classifiers tested with tf-idf. FastText was tested with deep learning models CNN and BiLSTM. The naive statistical classifier based on lexicon checking was the winner classifier.

Mazari and Kheddar [2023] introduced a new dataset for Algerian dialect toxic text detection. An annotated multi-label dataset is built, containing around 14K comments extracted from Facebook, YouTube, and Twitter and labeled as hate speech, offensive language, and cyberbullying. Several tests have been conducted using many classification models of traditional machine learning: Random Forest, Naıve Bayes, Linear Support Vector (LSV), Stochastic Gradient Descent (SGD), and Logistic Regression. Furthermore, several assessments have been conducted using Deep Learning models such as CNN, LSTM, Gated Recurrent Unit (GRU), BiLSTM and Bidirectional-GRU (Bi-GRU). Results demonstrated the best performance of LSV, BiLSTM, and MLP when associated with the SGD model.

Guellil et al. [2020] proposed a system for detecting hateful speech in Arabic political debates. The approach was evaluated against a hateful corpus concerning Algerian political debates. It contains 5K YouTube comments in MSA and Algerian dialects, written in both Arabic and Latin characters. Both classical algorithms of classification (Gaussian NB, Logistic Regression, Random Forest, SGD Classifier, and Linear SVC(LSVC)) and deep learning algorithms (CNN, multilayer perceptron (MLP), LSTM, and BiLSTM) are tested. For extracting features, the authors use Word2vec and FastText with their two implementations, namely, Skip Gram and CBOW. Simulation results demonstrate the best performance of LSVC, BiLSTM and MLP.

Mohdeb et al. [2022] proposed an approach for analysis and the detection of dialectal Arabic hate speech that targeted African refugees and illegal migrants on the YouTube Algerian space. The corpus contains more than 4K comments annotated as Incitement, Hate, Refusing with non-hateful words, Sympathetic, and Comment. The transfer learning approach has been exploited for classification. The experiments show that the AraBERT monolingual transformer outperforms the mono-dialectal transformer DziriBERT and the cross-lingual transformers mBERT and XLM-R.

3.2 Hate speech detection in other Arabic dialects

Various datasets or corpora were published in different dialects, which can be used for different purposes like hate speech, racism, violence, etc. detection.

ALBayari and Abdallah [2022] is the first work to propose a corpus built from Instagram comments. This corpus contains 198K comments, written in MSA and three different dialects: Egyptian, Gulf, and Levantine. The comments were annotated as neutral, toxic, and Bullying. Al-Ajlan and Ykhlef [2018] and Haidar et al. [2019] datasets are collected from Twitter containing respectively 20K and 34K multi-dialectal Arabic tweets annotated as bullying and non-bullying labels. These tweets were from various dialects (Lebanon, Egypt, and the Gulf area). Moreover, two other datasets were proposed by Mubarak et al. [2017]. The first one with 1.1K tweets in different dialects and the second dataset contains 32K inappropriate comments collected from a famous Arabic news site and annotated as obscene, offensive, or clean. Albadi et al. [2018] proposed the religious hate speech detection where a multi-dialectal dataset of 6.6K tweets was introduced. It included an identification of the religious groups targeted by hate speech. Alakrot et al. [2018] also provided a dataset of 16K Egyptian, Iraqi, and Libyan comments collected from YouTube. The comments were annotated as either offensive, inoffensive, or neutral.

T-HSAB Haddad et al. [2019] and L-HSAB Mulki et al. [2019] are two publicly available corpora for abusive hate speech detection. The first one is in the Tunisian dialect, combining 6K comments. The second one is in Levantine dialect (Syrian, Lebanese, Palestinian, and Jordanian dialects) containing around 6K tweets. These documents are labeled as Abusive, Hate, or Normal.

Mubarak et al. [2020] looked at MSA and four major dialects (Egyptian, Levantine, Maghrebi, and Gulf). It presented a systematic method for building an Arabic offensive language tweet dataset that does not favor specific dialects, topics, or genres with 10K tweets. For tweet labeling, they used the count of positive and negative terms based on a polarity lexicon. FastText and Skip-Gram (AraVec skip-gram, Mazajak skip-gram); and deep contextual embeddings, namely BERTbase-multilingual and AraBERT are used. They evaluated different models: SVM, AdaBoost, and Logistic regression.

Mulki and Ghanem [2021] introduced the first Arabic Levantine Twitter dataset for Misogynistic language (LeT-Mi) to be a benchmark dataset for automatic detection of online misogyny written in the Arabic and Levantine dialect. The proposed dataset consists of 6.5K tweets annotated either as neutral (misogynistic-free) or as one of seven misogyny categories: discredit, dominance, cursing/damning, sexual harassment, stereotyping and objectification, derailing, and threat of violence. They used BOW + TF-IDF, SOTA, LSTM, BERT, and Majority class as classifiers.

Duwairi et al. [2021] investigated the ability of CNN, CNN-LSTM, and BiLSTM-CNN deep learning networks to classify or discover hateful content posted on social media. These deep networks were trained and tested using the ArHS dataset, which consists of around 10K tweets that were annotated to suit hateful speech detection in Arabic. Three types of experiments are reported: first, the binary classification of tweets into Hate or Normal. Ternary classification of tweets into (Hate, Abusive, or Normal), and multi-class classification of tweets into (Misogyny, Racism, Religious Discrimination, Abusive, and Normal).

Aldjanabi et al. [2021] have built an offensive and hate speech detection system using a multi-task learning (MTL) model built on top of a pre-trained Arabic language model. The Arabic MTL model was experimented with two different language models to cover MSA and dialect Arabic. They evaluated a new pre-trained model ’MarBERT’ to classify both dialect and MSA tweets. They propose a model to explore multi-corpus-based learning using Arabic LMs and MTL to improve the classification performance.

Haidar et al. [2017] presented a solution for the issue of cyberbullying in both Arabic and English languages. The proposed solution is based on machine learning algorithms using a dataset from Lebanon, Syria, the Gulf Area, and Egypt. That dataset contained 35K Arabic texts. In this research, Naïve Bayes and SVM models were chosen to classify the text. The SVM model achieved greater precision.

Abdelali et al. [2016] The authors built a large dataset that consists of offensive Arabic words from different dialects and topics. The tweets were labeled into one of these categories: offensive, vulgar, hate speech, or clean. Since the offensive tweets involve implicit insults, the hate speech category was the tweets that contain racism, religious, and ethnic words. Different classifiers were employed in this study; the SVM model with a radial function kernel was mainly used with lexical features and pre-trained static embedding, while Adaptive Boosting and Logistic regression classifiers were employed when using Mazajak embedding. SVM gave the best precision.

According to this literature analysis, we detect that the topic of hate speech detection in the Algerian dialect is not widely considered, and only few works deal with this problem. Furthermore, a lack of Algerian datasets prepared for hate speech is found. All these findings motivate our proposal.