COVID Fake News Detection

Machine Learning

We construct a fake news detection algorithm using Natural Language Processing. We then use our model to quantify the extent to which these types of machine learning models degrade over time.

Marco Schildt https://github.com/m-schildt , Hannah Schweren https://github.com/hannahmagda , Steve Kerr https://smkerr.github.io/
2022-07-16

Abstract

Fake news represents one of the most pressing issues faced by democracies in the digital age. While there have been many examples of fake news consumption leading to the adoption of inaccurate beliefs and harmful behaviors, fake news regarding the Covid-19 pandemic is especially dangerous due to its potential to impact public health outcomes.

While machine learning has advanced our ability to identify and root out fake news, this approach is not without its limitations. Due to the constantly evolving nature of fake news, a fake news detection algorithm which is highly effective at one point in time may perform significantly worse later in time. This blog post describes the approach we took in our project and complements our final report which centers on the following question: To what extent does the fast-moving nature of fake news represent a limitation for fake news detection algorithms?

To answer this question, we construct our own fake news detection algorithm using the existing Covid-19 Fake News dataset which comprises 10,700 Covid-related social media posts which have been labeled either “real” or “fake.” Our model achieved a score of 93.5% when applied to social media posts from the same time period as the one on which it was trained, and received a score of 81.4% when applied to more recent social media posts. This roughly 10% decrease in performance supports our assumption that fake news detection algorithms degrade over time, albeit less quickly than we initially anticipated.

Background

Misinformation has always been a threat to society and democracy. Today, this danger is fueled by the potential of its rapid diffusion through social media. Machine learning (ML) and Natural Language Processing (NLP) enable us to detect fake news so that we might warn the reader and raise awareness about misinformation. This blog post will therefore present our work on the identification of such Fake News on social media platforms.

While some ML models are already very efficient at classifying whether news is real or fake, such models still face limitations. ML models are often bound to the data upon which they were originally trained, which is why the fast-moving nature of online content could degrade their effectiveness. Seeing as fake news is highly influenced by current events, we want to test the assumption that temporally rigid training sets are prone to a decline in effectiveness. As a current, politicized, and highly relevant topic, Covid-19 has created room for misinformation and intentional manipulation in social media, thereby representing an interesting case study.

This blog post will walk you through the two main components of our work:

  1. Building a fake news detection algorithm with a high level of effectiveness using the Covid-19 Fake News dataset.
  2. Applying our fake news detection algorithm to more recent data which we have collected ourselves to measure the extent to which the model degrades over time.

As mentioned in the introduction, the idea for the first part of our project is based on the paper of Patwa et al. (Patwa et al. 2021) which provided a good starting point for our further experimenting with the original and the new dataset. They achieve their best result with a Support Vector Machine classifier, scoring 92.8%.

The literature offers many similar approaches, for example, Varma, Rajshree et al. conducted a review of various ML and Deep Learning methods applied to the same topic coming to the conclusion that “Naive Bayes, Support Vector Machine, and Logistic regression are the most widely used supervised ML algorithms for developing fake news detection models” (Varma et al. 2021).

We hypothesized that there would be a reduction in the predictive power of our model over time. This assumption was based on previous research on the phenomenon of “concept drift” (Horne, Nørregaard, and Adali 2019). We assume that fake news constitutes non-stationary data whose characteristics might shift over time and thus a model trained on current data cannot provide long term reliable prediction which would pose a threat to the application of such models.

Proposed Method

Workflow

Figure 1: Workflow

In order to measure concept drift, we require datasets from two distinct time periods. We use the Covid-19 Fake News Detection dataset to train our model and to baseline its performance. We then apply our algorithm to a dataset we’ve created ourselves which contains Covid-related social media posts from early 2022 in order to determine the extent to which our model’s performance declines over time. Here’s a sample tweet to demonstrate our approach.

Sample Tweet

Figure 2: Sample Tweet

Our sample tweet comes from U.S. Representative Chip Roy who characterizes Pfizer’s request for vaccine authorization for children younger than five as having “zero basis in science.” Experts from PolitiFact, a widely recognized fact-checking site, assign this claim a “Pants on Fire!” rating which is reserved for inaccurate statements which make ridiculous claims.

PolitiFact Rating

Figure 3: PolitiFact Rating

We will use this tweet as an example to demonstrate our pre-processing pipeline. This process involves lower-casing, tokenizing, stop word removal, and lemmatization. The motivation here is to reduce the contents of each tweet in such a way that we maximize the amount of useful information while minimizing the amount of noise.

# sample tweet
tweet = "Literally using the force of government and the culture of fear to jab children under 5 - with zero basis in science - to make billions of dollars. Shameful, @pfizer."

# lowercase 
tweet_lower = tweet.lower() 
literally using the force of government and the culture of fear to jab children under 5 - with zero basis in science - to make billions of dollars. shameful, @pfizer.
tweet_tidy1 = re.sub(r"http(\S)+",' ', tweet_lower) # remove URLs   
tweet_tidy2 = re.sub(r"www(\S)+",' ', tweet_tidy1) # remove URLs
tweet_tidy3 = re.sub(r"&",' and ', tweet_tidy2) # replace & with ' and '
tweet_tidy4 = tweet_tidy3.replace('&amp',' ') # replace &amp with ' '
tweet_tidy5 = re.sub(r"[^0-9a-zA-Z]+",' ', tweet_tidy4) # remove non-alphanumeric characters
literally using the force of government and the culture of fear to jab children under 5 with zero basis in science to make billions of dollars shameful pfizer 
tweet_tokenized = tweet_tidy5.split() # tokenize 
['literally', 'using', 'the', 'force', 'of', 'government', 'and', 'the', 'culture', 'of', 'fear', 'to', 'jab', 'children', 'under', '5', 'with', 'zero', 'basis', 'in', 'science', 'to', 'make', 'billions', 'of', 'dollars', 'shameful', 'pfizer']
tweet_stopped = [w for w in tweet_tokenized if not w in stoplist] # remove stop words
['literally', 'using', 'force', 'government', 'culture', 'fear', 'jab', 'children', '5', 'zero', 'basis', 'science', 'make', 'billions', 'dollars', 'shameful', 'pfizer']
tweet_processed = [lemmatizer.lemmatize(w) for w in tweet_stopped] # lemmatize 
['literally', 'using', 'force', 'government', 'culture', 'fear', 'jab', 'child', '5', 'zero', 'basis', 'science', 'make', 'billion', 'dollar', 'shameful', 'pfizer']

Once the text data has been pre-processed, we begin to extract the features upon which we will train our model. Since computers can’t understand words, we translate our text data into numerical data. We use sklearn’s CountVectorizer to create a matrix containing information on the number of times a word appears. In this count matrix, each row represents a social media post and each column represents a word contained in the dataset. Here’s what that looks like for an example dataset containing four fake social media posts.

# create corpus of tweets
corpus = ["Literally using the force of government and the culture of fear to jab children under 5 - with zero basis in science - to make billions of dollars. Shameful, @pfizer.",
          "Dr. Anthony Fauci said lockdowns are a method for coercing people to comply with COVID-19 vaccinations.",
          "The drug labels for the Pfizer COVID-19 vaccine 'were blank when they should have contained all these diseases and adverse events' listed in a confidential report.",
          "Pfizer paid '$2.8 million bribe payment' to the FDA for COVID-19 vaccine approval."]

# count vectorizer
cv = CountVectorizer() # count term frequency

# fit & transform tweet into count vector
count_matrix = cv.fit_transform(corpus_clean)

# convert count matrix to data frame
count_array = count_matrix.toarray()
tweet_cv_df = pd.DataFrame(data=count_array,columns = cv.get_feature_names())

# display count matrix
tweet_cv_df.head() 
Count Matrix

Figure 4: Count Matrix

While simply creating a data frame of term frequencies may be sufficient, we can improve upon this with a technique known as term frequency-inverse document frequency or TF-IDF for short. While the acronym is a bit of a mouthful, the concept behind it is quite simple. In plain terms, TF-IDF takes a count matrix like the one we just created as input and weights each term by how frequently that term appears in the overall collection of tweets.

Words which are used infrequently receive a higher weighting than words that are commonplace. For example, the word “Covid” appears frequently in our dataset, so it would receive a higher weighting than a less common word such as “Ivermectin.” As such, we are telling the computer to pay special attention to words which are less common. Here’s what that looks like continuing along with the example from above.

# TF-IDF transformer
tfidf = TfidfTransformer() # weight counts by word frequency

# fit & transform count matrix to TF-IDF matrix
tweet_tfidf = tfidf.fit_transform(tweet_cv)

# convert to data frame
tfidf_array = tweet_tfidf.toarray()
tweet_tfidf = pd.DataFrame(data=tfidf_array,columns = cv.get_feature_names())

# display TF-IDF matrix
tweet_tfidf.head()
TF-IDF Matrix

Figure 5: TF-IDF Matrix

Aside from mere word counts, we can extract additional features from our text data by converting each word to a numerical representation of that word’s meaning, also referred to as a “word vector.” To accomplish this, we use GloVe, an unsupervised learning algorithm capable of encoding the co-occurrence probability ratio between two words as vector differences. In practical terms, this enables the computer to derive meanings from the words in our dataset. Since we deemed training our model on each and every word in our dataset to be too computationally expensive, we instead compute one word vector for each social media post by averaging the word vectors from each word contained in that post. Continuing with the example from above, here’s the code implementation.

import spacy 
nlp = spacy.load("en_core_web_md") # load reduced word vector table with 20k unique vectors for ~500k words

# create list of word vectors
word2vec_list = [nlp(doc).vector.reshape(1,-1) for doc in corpus]

# join word vectors
word2vec_data = np.concatenate(word2vec_list) 

# convert to data frame
word2vec_df = pd.DataFrame(word2vec_data) 
Word Vectors

Figure 6: Word Vectors

Now that we’ve extracted features from the text data, it’s time to train our fake news detection algorithm. To do so, we tested several common classification algorithms and used the best one as our baseline. This served as a starting point for the further finetuning of the classifiers on the validation data.

Once we have tuned and tested our models, we apply our best model to our more recent dataset to compare the performance. In doing so, we test our hypothesis by evaluating the extent to which concept drift (and possibly other factors) have degraded the quality of our fake news detection algorithm.

Experiments

Original Data: We use the Covid-19 Fake News Detection dataset created in 2020 by Patwa et al. and featured in the Constraint@AAAI2021 competition (Patwa et al. 2021).1 The dataset comprises 10,700 Covid-related social media posts from various platforms including Twitter, Facebook, and Instagram. Each post contains a falsifiable statement which has been labeled either “real” or “fake.”

   id                                              tweet label
0   1  The CDC currently reports 99031 deaths. In gen...  real
1   2  States reported 1121 deaths a small rise from ...  real
2   3  Politically Correct Woman (Almost) Uses Pandem...  fake
3   4  #IndiaFightsCorona: We have 1524 #COVID testin...  real
4   5  Populous states can generate large case counts...  real

Patwa et al.’s Covid-19 Fake News Detection dataset has already been split into training (60%), test (20%), and validation sets (20%). The dataset is approximately balanced between real (52.3%) and fake (47.7%). Details regarding the distribution of real and fake posts across training, test, and validation sets can be found in Figure 7.

Dataset composition before pre-processing

Figure 7: Dataset composition before pre-processing

“Fake” claims are collected from various fact-checking websites such as PolitiFact, Snopes, Boomlive, etc. and from tools like Google fact-check-explorer and IFCN chatbot. “Real” claims include tweets from credible and relevant Twitter accounts such as the World Health Organization (WHO) and the Centers for Disease Control and Prevention (CDC) among others.

We acknowledge the shortcomings associated with sourcing “real” data exclusively from Twitter while sourcing “fake” data from various social media platforms. To correct for this, we remove all social media posts exceeding Twitter’s 280 character limit from the dataset during pre-processing in an effort to make the “real” and “fake” data more comparable. While this did reduce the total number of observations, the amount of data is still sufficient to proceed. Moreover, we maintain the roughly equal split between real (49.1%) and fake (50.9%) social media posts in the dataset. See Figure 8 for a more detailed breakdown.

Dataset composition after pre-processing

Figure 8: Dataset composition after pre-processing

To familiarize ourselves with the data and to detect potential biases, we conducted an exploratory analysis. Our analysis revealed that the number of characters (Figure 9) and the number of words (Figure 10) are approximately balanced between real and fake posts.

Characters per tweet

Figure 9: Characters per tweet

Words per tweet

Figure 10: Words per tweet

Additionally, we examined high-frequency words (Figure 11) for both fake and real claims to gain further insight into the data. No significant discrepancies could be identified.

Word frequency

Figure 11: Word frequency

Lastly, we used part-of-speech (POS) tagging to gain a deeper understanding of the tweets. Besides a slight imbalance in the frequency of nouns between the real (51.3%) and fake (59.7%) posts, the overall balance gave no indication of significant biases in the dataset (Figure 12).

POS Tag Counts

Figure 12: POS Tag Counts

In sum, our exploratory analysis did not uncover any significant underlying issues with the Covid-19 Fake News Detection dataset, thereby allowing us to proceed with our project.

New data: For the second part of the project, we created our own dataset of more recent fake and real posts which were published between January and April 2022. We employed methods which closely mirror those used in the original paper by Patwa et al. so as to ensure that our dataset is comparable to the original dataset in all aspects aside from time period (Patwa et al. 2021). Thus, we collected tweets from the same Twitter accounts and used Politifact.com as a source for fake news, as the authors did for the original dataset.

While the authors of the original dataset did use additional fact-checking sources beyond PolitiFact when compiling fake claims such as Snopes, Boomlive, and Google fact-check-explorer among others, due to time constraints, our project only used data from PolitiFact.com. Despite this, we were still able to create a large enough dataset to draw conclusions. In the creation of our dataset, we took special care to maintain an equal share of fake and real posts within the dataset (the final new dataset contains 100 real and 100 fake posts). An overview of the data collection process is provided in Figure 13.

Data collection workflow

Figure 13: Data collection workflow

Evaluation method: The fake news detection algorithm that we build during the first part of our project is evaluated using a F1-score. We consider our fine-tuned model to be successful if it achieves a higher F1-score than our baseline model which received a F1-score of 92.8%.

The F1-score is also present in the second part of our project where we attempt to quantify the amount of concept drift associated with Covid-related fake news. We apply our fine-tuned fake news detection algorithm which was developed during the first part of the project to both the original dataset and the new dataset which we have created ourselves. The difference between the two F1-scores indicates the magnitude of the concept drift associated with Covid-related fake news between the two time periods in which the datasets were created.

While we acknowledge that there are several reasons which could explain the difference in F1-scores such as inconsistencies in the way both datasets were created, we believe that any gap between the two F1-scores contains valuable information regarding the magnitude to which concept drift may be present.

Software: The project was implemented with Python. Besides basic dependencies such as pandas and numpy, we use scikit-learn, NLTK, and SpaCy packages to pre-process text data, extract features, and train our models.

Experimental details: When creating our baseline model, we selected off-the-shelf models including Logistic Regression, Decision Tree, Gradient Boosting, and Linear SVC classifiers. We used default parameters for all models when identifying the most highest-scoring model (based on on F1-score) to serve as our baseline.

For the word embedding feature extraction process, we relied upon SpaCy’s English-language reduced word vector table with 20,000 unique vectors for nearly 500,000 words which was derived from pre-trained word vectors from the GloVe Common Crawl algorithm. Through this process, a vector of 300 features was generated for each individual tweet.

When it came to fine-tuning our model in order to out-perform the baseline model, we performed an optimization of hyperparameters for the feature extraction by using a grid search with the default 5-fold cross validation and scored using the F1-score metric.

Results: Figure 14 provides an overview of the quantitative performance of our baseline models. The Linear SVC classifier reached the highest F1-score with 92.8% on the test set.

Overview of F1-scores for ML classifiers without parameter tuning

Figure 14: Overview of F1-scores for ML classifiers without parameter tuning

By fine-tuning our models, we beat our baseline model. The SVM classifier outperformed the other models with a F1-score of 93.5%

Overview of F1-scores on test data for fine-tuned ML classifiers without word embedding

Figure 15: Overview of F1-scores on test data for fine-tuned ML classifiers without word embedding

And when we include word embedding the SVM classifier achieved an F1-score of 93.8% Even though including word embeddings improved our model, it drastically increased the runtime as it naturally enlargened the dataset.

Overview of F1-scores on test data for fine-tuned ML classifiers with word embedding

Figure 16: Overview of F1-scores on test data for fine-tuned ML classifiers with word embedding

Table 15 and 16 also display the results for testing the models on our new dataset. The SVM scores a result of 85.9% on the new data. The Logistic Regression classifier performs slightly better than SVM, obtaining an F1-score of 85.5% when applied to the new dataset. Overall, the F1-score dropped by more or less 10% for all of the models when applied to the new data.

Analysis

For further analysis we created scattertext plot visualizing which words and phrases in the training set correspond more often with real or fake tweets.2 The \(x\)- and \(y\)-axes are dense ranks for the terms wtih the highest level of association with real and fake tweets respectively. One element that can be observed in the plot is that some terms which often correspond with real news (blue) are associated with India, while terms relating to American presidents are more often associated with fake news (red). It can also be observed that certain terms with a high association are hashtags or even Twitter handles (e.g. indafightscorona), which might suggest that models base their classification on them.

example for dangerous fake news

Figure 17: example for dangerous fake news

Figures 18 and 19 present a random sample of the social media posts which were misclassified by our algorithm when applied to the test set. By manually analyzing them we could not detect a clear pattern that characterizes the false negatives and positives.

False Positives

Figure 18: False Positives

False Negatives

Figure 19: False Negatives

However, it is remarkable that fake social media posts which were misclassified as true are not easily classifiable as fake news but would require expert knowledge or further research. One reason for this is the occurrence of words like “cases” or “details” that are associated with real news and also appeared more often in the real posts from the training data.

Ethics: Even the best model will always contain some rate, however small, of misclassified statements. It is therefore essential that even if the application of our model could achieve an overall improvement of readers’ insights in fake news, this aspect is kept in mind before deploying the model in a real-world context.

Conclusions

After achieving a relatively high performance of 93.8% with SVM, the predictive power was reduced quite strongly when we applied it to our new fake and real news dataset (84.9%). We assume that this reduction, however, is not exclusively attributable to the concept drift (for example, there might be words in the new dataset that were not present in the training set). However, based on previous research on this, we suspect that concept drift strongly contributes to the prediction loss on the new data.

Acknowledgements

We thank Patwa et al. for providing the CONSTRAINT-2021 dataset which we used to train and test our models as well as used as a benchmark for our new data (Patwa et al. 2021).

Horne, Benjamin D., Jeppe Nørregaard, and Sibel Adali. 2019. “Robust Fake News Detection over Time and Attack.” ACM Trans. Intell. Syst. Technol. 11 (1). https://doi.org/10.1145/3363818.
Patwa, Parth, Shivam Sharma, Srinivas Pykl, Vineeth Guptha, Gitanjali Kumari, Md Shad Akhtar, Asif Ekbal, Amitava Das, and Tanmoy Chakraborty. 2021. “Fighting an Infodemic: Covid-19 Fake News Dataset.” In International Workshop onCombating Online Hostile Posts in Regional Languages During Emergency Situation, 21–29. Springer.
Varma, Rajshree, Yugandhara Verma, Priya Vijayvargiya, and Prathamesh P Churi. 2021. “A Systematic Survey on Deep Learning and Machine Learning Approaches of Fake News Detection in the Pre-and Post-COVID-19 Pandemic.” International Journal of Intelligent Computing and Cybernetics.

  1. The competition can be accessed here: https://competitions.codalab.org/competitions/26655↩︎

  2. Package repository can be accessed here: https://github.com/JasonKessler/scattertext↩︎

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/smkerr/COVID-fake-news-detection, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".