Project 4 - Fake News Prediction using Machine Learning with Python

Project 4 - Fake News Prediction using Machine Learning with Python

My Machine Learning Beginner Projects, Entry 4

I am Salim Olanrewaju Oyinlola. I identify as a Machine Learning and Artificial Intelligence enthusiast who is quite fond of making use of data and patterns to develop insights and analysis.

In my opinion, Machine learning is where the computational and algorithmic skills of data science meets the statistical thinking of data science. The result is a collection of approaches that requires effective theory as much as effective computation. There are a plethora of machine learning model, with each of them working best for different problems. As such, I believe understanding the problem setting in machine learning is essential to using these tools effectively. Now, the best way to UNDERSTAND different problem settings is by PLAYING AROUND with different problem settings. That is the genesis behind this writing series - My Machine Learning Projects. Over the course of this writing series, I would solve a machine learning problem daily. These problems will range from a plethora of fields whilst requiring and covering a range of models. A link to my previous articles can be found here.

Project Description: This model is capable of detecting if a given news headline is fake.

URL to Dataset: Download here

Line-by-line explanation of Code

import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

The block of codes above imports the dependencies used in the model.

import numpy as np imports the numpy library which can be used to perform a wide variety of mathematical operations on arrays.

import pandas as pd imports the pandas library which is used to analyze data.

import re imports the regular expression library which provides regular expression matching operations similar to those found in Perl. A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

from nltk.corpus import stopwords imports a list of stop words using the corpus function in the nltk module which contains a list of stop words. The stopwords in nltk are the most common words in data. They are words that you do not want to use to describe the topic of your content.

from nltk.stem.porter import PorterStemmer imports the PorterStemmer function from the nltk.stem.porter module. Stemmers remove morphological affixes from words, leaving only the word stem.

  • NOTE: nltk stands for Natural Language Toolkit which is used for natural language processing with Python.

from sklearn.feature_extraction.text import TfidfVectorizer imports the TfidfVectorizer function from sklearn's feature_extraction module. This converts a collection of raw documents to a matrix of TF-IDF features.

from sklearn.model_selection import train_test_split imports the train_test_split function from sklearn's model_selection library. It will be used in spliting arrays or matrices into random train and test subsets.

from sklearn.linear_model import LogisticRegression imports the LogisticRegression Machine Learning model from sklearn's linear_model library. This model will be used in training the model.

from sklearn.metrics import accuracy_score imports the accuracy_score function from sklearn's metrics library. This model is used to ascertain the performance of our model.

  • NOTE: The logistic regression model which is a classification model is used because the problem is a classification problem. We are trying to group texts into Fake news or real news based on certain properties.
import nltk
nltk.download('stopwords')

This downloads every stopwords in nltk data, so this can take a while. It should return an output, True as shown below.

image.png

print(stopwords.words('english'))

This line of code prints the stopwords in English.

These words include: 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've" etc.

salim_news_dataset = pd.read_csv(r'C:\Users\OYINLOLA SALIM O\Downloads\fake-news\train.csv')

This loads the dataset to a pandas DataFrame with the variable name, salim_news_dataset

salim_news_dataset.head()

This print the first 5 rows of the dataframe.

image.png

The attributes are as follows;

id: unique id for a news article title: the title of a news article author: author of the news article text: the text of the article; could be incomplete label: a label that marks whether the news article is real or fake: 1: Fake news 0: real News

salim_news_dataset.isnull().sum()

This counts the number of missing values in the dataset. It is seen that 558 missing values as title, 1957 missing in the author and 39 missing in text and no missing value in id and label.

salim_news_dataset = salim_news_dataset.fillna('')

This replaces the null values with empty string.

salim_news_dataset['content'] = salim_news_dataset['author']+' '+salim_news_dataset['title']

This merges the author name and news title.

X = salim_news_dataset.drop(columns='label', axis=1)
Y = salim_news_dataset['label']

This block of code separates the data & label.

salim_port_stem = PorterStemmer()

This creates an instance of PorterStemmer. Recall that Stemmers remove morphological affixes from words, leaving only the word stem.

def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [salim_port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

This block of code creates a function that stems inputs. Stemming is the process of reducing a work to its root word. Removes the prefix, suffixes etc.

salim_news_dataset['content'] = salim_news_dataset['content'].apply(stemming)

This stems the inputs.

X = salim_news_dataset['content'].values
Y = salim_news_dataset['label'].values

This block of code separates the data and label.

Y.shape

This returns the number of Y values we have. Here, we have 20800.

vectorizer = TfidfVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)

This converts the textual data into numerical data.

print(X)

This line of code is present just to ensure that we were able to transform the textual data into numerical data.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

The train_test_split method that was used earlier is hence called and used to divide the dataset into train set and test set.

  • The 0.2 value of test_size implies that 20% of the dataset is kept for testing whilst 80% is used to train the model.
model = LogisticRegression()
model.fit(X_train, Y_train)

This block of code creates an instance of the Logistic Regression classification ML model and trains it with the train set.

X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy score of the training data : ', training_data_accuracy)

This block of code evaluates the accuracy score on the training data. We see an accuracy score of 0.9866586538461538

X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy score of the test data : ', test_data_accuracy)

This block of code evaluates the accuracy score on the test data. We see an accuracy score of 0.9790865384615385

That's it for this project. Be sure to like, share and keep the discussion going in the comment section. .ipynb file containing the full code can be found here.