This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Dylan Warman, School of Computing, Charles Sturt University;
(2) Muhammad Ashad Kabir, School of Computing, Mathematics,.
A. Dataset Pre-processing and Configurations
We have used the CoAID (Covid-19 heAlthcare mIsinformation Dataset) [19] with 5216 total news items and the C19-Rumor (A COVID-19 Rumor Dataset) dataset [20] with a total of 4129 news items. Both datasets have dedicated segments for news headlines and news articles about COVID19. We selected these two datasets given their spread of real and fake news complement each other, with CoAID heavily weighted towards true or real news and C19-Rumor heavily weighted towards false or fake news. This alternate weighting allows us to test the datasets individually, with augmentation, and in combination, to assess the impacts of different class weightings on the data.
Fig. 1 illustrates the word cloud of real and fake news for both datasets. It allows for a direct comparison between the two datasets. From these word clouds, we can deduce that while COVID-19 and its variants are the predominant terms used within the CoAID dataset, the key terms within the C19- Rumor dataset are more aligned with words such as pandemic, outbreak, China, and Wuhan. This observation suggests that the CoAID dataset, given its focus on health and medicalrelated news, is unlikely to contain information related to the outbreak, videos, or Wuhan specifically.
We created seven configurations using the two named datasets for an extensive experimental evaluation. These configurations, as well as the training, validation, and test splits, are outlined in Table I. The objectives of configurations C1 and C3 are to establish baselines for each of the two datasets. Configurations C2 and C4 aim to assess whether data augmentation, particularly considering the limited dataset size, offers any advantages in terms of accuracy or if it adversely affects classification. C5 and C6 are employed to investigate two aspects: firstly, whether the datasets are representative of each other, and secondly, how well a model developed using 2020 COVID-19 fake news performs on news stories from 2021, and vice versa. Finally, the purposes of C7 (merged dataset) are to improve model robustness, increase generalisation, and mitigate the limitations associated with individual datasets such as imbalances in class distribution, lack of coverage for specific topics, or biases in data collection, which can ultimately enhance the accuracy and effectiveness of fake news detection systems.
B. Machine Learning Techniques
In this study, we employed a baseline CNN [21] model, and two advanced models, BERT [22] and Bi-LSTM [12] which have demonstrated high efficiency in fake news detection according to prior research [11], [12], [23]. Both BERT and Bi-LSTM have shown strong performance with small datasets [24], [25], which is crucial given the small size of our sourced datasets. In contrast, a CNN model is a more basic algorithm type compared to the first two, providing a valuable baseline to assess what a simpler algorithm can achieve with the same dataset.
BERT [22] holds immense promise for fake news detection due to its ability to comprehend the nuances of language and context. By pre-training on a massive corpus of text, BERT becomes adept at understanding the subtle linguistic cues that often distinguish fake news from genuine content. Its bidirectional architecture allows it to capture relationships between words, making it highly effective in discerning the contextual intricacies that fake news articles often employ to deceive readers. Additionally, BERT’s fine-tuning capability enables it to adapt to specific datasets, thereby enhancing its accuracy in identifying misleading or fabricated information. As fake news continues to pose a significant challenge, BERT’s natural language processing prowess positions it as a valuable tool in the ongoing fight against misinformation and disinformation.
Bi-LSTM [12], on the other hand, is a recurrent neural network, that specializes in capturing sequential dependencies in text. This makes it particularly effective at discerning subtle linguistic patterns within shorter pieces of text and excels at analyzing the structural flow of information within news articles.
C. Explainable Techniques
SHAP (SHapley Additive exPlanations) [26] and LIME (Local Interpretable Model-Agnostic Explanations) [27] are two popular explainable machine learning techniques used in the context of fake news detection to provide insights into model predictions and make the decision-making process more transparent. In this study, we employed SHAP as it holds several advantages over LIME when it comes to explaining the predictions of machine learning models. One key advantage is the global interpretability that SHAP offers. Unlike LIME, which provides local explanations for individual predictions, SHAP calculates feature importance consistently across all possible feature combinations [26]. This means that SHAP gives a holistic view of how each feature impacts model predictions across the entire dataset, allowing for a more comprehensive understanding of the model’s behavior. Additionally, SHAP is grounded in cooperative game theory, providing a mathematically rigorous framework for explaining the contributions of each feature. This makes SHAP particularly useful when it needs to identify overarching patterns and relationships within data, which is often crucial in applications like fake news detection, where understanding global linguistic and contextual patterns is essential for model transparency and improvement.
D. Web Application Architecture
Our web application’s architecture, depicted in Fig. 2, comprises two main components: a server component hosted on the AWS (Amazon Web Services) cloud and a client component implemented as a Chrome plugin.
The server for our application is hosted on an S3 (Simple Storage Service) Bucket within AWS1 . This bucket serves as a storage location, providing access to the files required to run the algorithm. The server hosts a trained machine learning model (Section III-B) and the request handler for the application. To publish this on AWS and generate the necessary components for the API, we used Cortex2 . Cortex is a free application that employs Docker images to automatically create Kubernetes clusters on AWS. A Kubernetes cluster consists of a set of functions from AWS that facilitate communication among the nodes, each representing a specific feature or function. This approach allows us to generate an API and a public access point for the application without the need to manually configure each feature individually. Once the API is accessible, it is connected using JavaScript within the client component (i.e., Chrome extension). This involves sending a request to the previously uploaded request handler within the S3 bucket, which generates a prediction and a set of force values representing explainability. These results are then returned to the Chrome extension.
The client component (Chrome plugin) is built using HTML, with JavaScript providing the essential functionality for transmitting and processing text. JavaScript processes a request by reading the user-selected data (highlighted text) from the active webpage when the user clicks the application icon. Subsequently, the extension converts the response (classification result with an explanation) from the server into a readable format, applies CSS formatting to provide colorcoding for the explainability element, and presents the results to the user.