Research project on Fake News Classification where I experimented, analyzed and interpreted numerous algorithms and vectorizers on multiple datasets of Fake/Real News to achieve the best accuracies possible.
Finally my project is done! This time I decided to work on Fake News Classification!
I approached this project as more of a research; gathering as many datasets as possible, seeing what people tried with them, then doing my own tweakings and conclusions.
The reason for that is that as you might have expected or already known, this is no easy thing to predict, not at all.
I believe it gives one a clearer vision/understanding of the tools used especially with Natural Language Processing.
That there exists complex problems that still require a lot of work and cooperation between specialists from different fields in order to understand better(as will be shown below).
Enjoy!
Classifying news as real/fake is equally very important and very hard.
And so, as the miguelmalvarez article(see Sources of inspiration) explains, Divide & Conquer must be used in order to tackle this multi-form problem. It divided it into: Fact Checking, Source credibility and Trust, News bias and Misleading headlines. They are well detailed there.
The thing that everyone agrees about is that Journalists must cooperate with AI experts to make real advancement in this matter.
But for now, let’s see how our regular algorithms perform !
1) https://www.kaggle.com/hassanamin/textdb3 : News articles from the 2016 American Presidential Campaign
2) https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset : Already divided into real.csv and fake.csv
All datasets will be converted to one single format, to ease the implementation.
The format will contain 2 columns: ‘text’ + ‘label’(REAL or FAKE)
The ‘title’ will be appended to the ‘text’ and so the ‘title’ column will be dropped.
It contains 6335 rows of news from the 2016 American Presidential period.
Analysis:
Same thing with the ‘title’ column, and the ‘subject’ and ‘date’ columns will be dropped. Finally the fake.csv and real.csv will be merged and we’ll add the ‘label’ column and fill it with FAKE or REAL.
It is mostly US News from the end 2017, having 44898 rows ~= x7 the size of the first dataset.
Analysis:
Vectorizers:
Classifiers:
All algorithms and vectorizers are implemented in a smart and generic way.
That way we can couple Vectorizers and Algorithms and try all possible combinations!
The execute_algorithm() function takes in a dataset, a vectorizer and an algorithm. It applies the vectorizer to the dataset that tokenizes it and passes the vocabulary to the algorithm, which fits and predicts. Finally it returns the accuracy score and and the confusion Matrix.
75% of the data for every test was reserved for training.
Example:
We’ll be picking the best 5 out of the 21 scores for each dataset.
Confusion matrix of the Passive Aggressive Classifer with the TfidfVectorizer:
Confusion matrix of the Passive Aggressive Classifer with the TfidfVectorizer:
It seems that after getting 94% and 97% accuracy in the 2 datasets, that our work is done.
But that’s not really the case.
There is a great write-up on StackOverflow with this incredibly useful function for finding vectors that most affect labels(meaning what words are most important to our models)
It only works for binary classificaiton (classifiers with 2 classes) so it will work in our case since we only have ‘FAKE’ and ‘REAL’ labels.
Here is its implementation using our best classifier(PassiveAgressiveClassifier) coupled with the TfidfVectorizer:
The results are shockingly noisy, and although there is some reason in them, most of it don’t mean good.
For example:
Finally, I am satisfied with the effort put into this but not quite with the results.
I have learned a lot about Natural Language Processing; what works best with it, what doesn’t, what are the challenges etc..
But what I’m sure about is that more knowledge from other fields(journalists etc..) must be brought into this to shed more light on the obstacles and hopefully with more talk/competitions made on the subject, it will happen soon.
Thank you for reading !