Project 9 - Medical Insurance Cost Prediction using Machine Learning with Python
My Machine Learning Beginner Projects, Entry 9
I am Salim Olanrewaju Oyinlola. I identify as a Machine Learning and Artificial Intelligence enthusiast who is quite fond of making use of data and patterns to develop insights and analysis.
In my opinion, Machine learning is where the computational and algorithmic skills of data science meets the statistical thinking of data science. The result is a collection of approaches that requires effective theory as much as effective computation. There are a plethora of machine learning model, with each of them working best for different problems. As such, I believe understanding the problem setting in machine learning is essential to using these tools effectively. Now, the best way to UNDERSTAND different problem settings is by PLAYING AROUND with different problem settings. That is the genesis behind this writing series - My Machine Learning Projects. Over the course of this writing series, I would solve a machine learning problem daily. These problems will range from a plethora of fields whilst requiring and covering a range of models. A link to my previous articles can be found here.
Project Description: This project is a regression problem that predicts the medical insurance cost for citizens using certain attributes.
URL to Dataset: Download here
Line-by-line explanation of Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
This block of codes above imports the third-party dependencies needed in the model.
import numpy as np
imports the numpy
library which can be used to perform a wide variety of mathematical operations on arrays.
import pandas as pd
imports the pandas
library which is used to analyze data.
import matplotlib.pyplot as plt
imports the PyPlot function from the MatPlotLib library which is used to visualize data and trends in the data.
import seaborn as sns
imports the seaborn library which is used for making statistical graphics. It builds on top of matplotlib and integrates closely with pandas data structures. Seaborn helps you explore and understand your data.
from sklearn.model_selection import train_test_split
imports the train_test_split function from sklearn's model_selection library. It will be used in spliting arrays or matrices into random train and test subsets.
from sklearn.linear_model import LinearRegression
imports the LinearRegression machine learning model.
from sklearn import metrics
imports the metrics library from the sklearn library. This model is used to ascertain the performance of our model.
salim_insurance_dataset = pd.read_csv(r'C:\Users\OYINLOLA SALIM O\Downloads\insurance.csv')
This loads the data from csv file to a Pandas DataFrame.
salim_insurance_dataset.head()
This prints out the first 5 rows of the dataframe.
salim_insurance_dataset.info()
This gets some informations about the dataset.
salim_insurance_dataset.isnull().sum()
This checks for missing values.
salim_insurance_dataset.describe()
This checks the statistical measures of the dataset.
sns.set()
plt.figure(figsize=(6,6))
sns.distplot(salim_insurance_dataset['age'])
plt.title('Age Distribution')
plt.show()
This displays the distribution of age value.
plt.figure(figsize=(6,6))
sns.countplot(x='sex', data=salim_insurance_dataset)
plt.title('Sex Distribution')
plt.show()
This does the same for the Gender column.
salim_insurance_dataset['sex'].value_counts()
This prints the number of males and females in the dataset.
male
676
female
662
plt.figure(figsize=(6,6))
sns.distplot(salim_insurance_dataset['bmi'])
plt.title('BMI Distribution')
plt.show()
This displays the distribution of the bmi
column.
- It is important to note that Normal BMI Range --> 18.5 to 24.9
plt.figure(figsize=(6,6))
sns.countplot(x='children', data=salim_insurance_dataset)
plt.title('Children')
plt.show()
This displays a countplot of the children column.
salim_insurance_dataset['children'].value_counts()
This displays the number of each unique values of the children
column.
plt.figure(figsize=(6,6))
sns.countplot(x='smoker', data=salim_insurance_dataset)
plt.title('smoker')
plt.show()
Evalauting the smoker
column for better understanding.
salim_insurance_dataset['smoker'].value_counts()
This displays the number of each unique values of the smoker
column.
plt.figure(figsize=(6,6))
sns.countplot(x='region', data=salim_insurance_dataset)
plt.title('region')
plt.show()
Evalauting the region
column for better understanding.
salim_insurance_dataset['region'].value_counts()
This displays the number of each unique values of the region
column.
plt.figure(figsize=(6,6))
sns.distplot(salim_insurance_dataset['charges'])
plt.title('Charges Distribution')
plt.show()
Evalauting the distribution for the charges
column for better understanding.
The next step is label-encoding.
salim_insurance_dataset.replace({'sex':{'male':0,'female':1}}, inplace=True)
This encodes the sex
column. As such, the male
column represents 0
and the female
column represents 1
.
salim_insurance_dataset.replace({'smoker':{'yes':0,'no':1}}, inplace=True)
This encodes the smoker
column. As such, the yes
column represents 0
and the no
column represents 1
.
salim_insurance_dataset.replace({'region':{'southeast':0,'southwest':1,'northeast':2,'northwest':3}}, inplace=True)
This encodes the region
column. As such, the southeast
column represents 0
, the southwest
column represents 1
, the northeast
column represents 2
, the northwest
column represents 3
.
X = salim_insurance_dataset.drop(columns='charges', axis=1)
Y = salim_insurance_dataset['charges']
This splits the Features and Target.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
The train_test_split
method which was imported earlier is hence called and used to divide the dataset into train set and test set.
NOTE: The 0.2
value of test_size implies that 20% of the dataset is kept for testing whilst 80% is used to train the model.
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
This creates an instance of the Linear Regression model and then, training it with the train dataset.
training_data_prediction =regressor.predict(X_train)
r2_train = metrics.r2_score(Y_train, training_data_prediction)
print('R squared vale : ', r2_train)
This predicts on the training set and evaluates the r2-score.
R squared vale : 0.751505643411174
.
test_data_prediction =regressor.predict(X_test)
r2_test = metrics.r2_score(Y_test, test_data_prediction)
print('R squared vale : ', r2_test)
This predicts on the test set and evaluates the r2-score.
R squared vale : 0.7447273869684077
.
input_data = (31,1,25.74,0,1,0)
# changing input_data to a numpy array
input_data_as_numpy_array = np.asarray(input_data)
# reshape the array
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)
prediction = regressor.predict(input_data_reshaped)
print(prediction)
print('The insurance cost is USD ', prediction[0])
At Step 1
, we collect input data from the users.
At Step 2
, we change input_data to a numpy array.
At Step 3
, we reshape the array.
At Step 4
, we do the prediction.
That's it for this project. Be sure to like, share and keep the discussion going in the comment section. .ipynb
file containing the full code can be found here.