Misinformation Identification Using Natural Language Processing
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The popularity of social media has accelerated the speed and scope of fake news propagation, and exacerbated the harm caused by false information. Identifying misinformation is crucial to maintain a countrys political, social, financial stability and democracy. In this thesis, we study the problem of misinformation identification using natural language processing (NLP). Given a claim, our approach classifies a claim as true, partly true or false using a set of news articles whose contents are related to the claim. The set of related articles, collected from reputable sources, serves as the ground truth to assess the validity of the claim.
Using this approach of misinformation identification, the contributions of this thesis is to address the following research problems:
-
We constructed a new large-scale, feature-rich dataset of COVID-19 news and facts for research on COVID-19 misinformation, which is named COVMIS. We provide a comprehensive analysis of the dataset to better understand the data, including claim contents, article contents, publication dates, news sources, and country distribution. We also discuss potential use cases to demonstrate the benefits of the dataset for research on misinformation-related COVID-19 and other areas.
-
We conducted two sets of extensive experiments to evaluate several state-of-the-art transformer-based NLP models using the COVMIS dataset. The models that were evaluated are BERT (Bidirectional Encoder Representations from Transformers), DistilBERT, XLNet (Generalized Autoregressive Pretraining for Language Understanding), ALBERT (A Lite BERT), and RoBERTa (Robustly Optimized BERT Pre-training Approach). The first set of experiments shows that BERT performs the best in terms of F1 score. In the second set of experiments, we evaluated an optimization: instead of inputting all articles related to a claim to classify the claim, we extracted and input only a subset of K sentences (e.g., K = 5) that are the most relevant to the claim. Experimental results show that this optimization improves the performance of the models in terms of accuracy, F1 score, precision and recall, given different values of K.
-
We conducted two sets of extensive experiments on a news classification model based on BERT and evaluated the performance of the model in terms of accuracy, F1 score, precision, and recall. We used two datasets: (i) the general news dataset provided by the Fake News Challenge competition and (ii) the COVMIS dataset mentioned above. The first set of experiments was designed to answer the question of whether narrowing down the domain of knowledge (i.e., COVID-related news vs. general news) will improve the classification performance. Our experimental results show that the classification performance of the model improves significantly when the domain of knowledge of the dataset is narrowed down to a specific area of interest, COVID-19 in this case. The second set of experiments quantified how obsolete training data affect the classification performance. Our experimental results show that the more up-to-date the training data (relative to the test data), the better the classification performance.