Project 7 - Car Prices Prediction using Machine Learning with Python
My Machine Learning Beginner Projects, Entry 7
I am Salim Olanrewaju Oyinlola. I identify as a Machine Learning and Artificial Intelligence enthusiast who is quite fond of making use of data and patterns to develop insights and analysis.
In my opinion, Machine learning is where the computational and algorithmic skills of data science meets the statistical thinking of data science. The result is a collection of approaches that requires effective theory as much as effective computation. There are a plethora of machine learning model, with each of them working best for different problems. As such, I believe understanding the problem setting in machine learning is essential to using these tools effectively. Now, the best way to UNDERSTAND different problem settings is by PLAYING AROUND with different problem settings. That is the genesis behind this writing series - My Machine Learning Projects. Over the course of this writing series, I would solve a machine learning problem daily. These problems will range from a plethora of fields whilst requiring and covering a range of models. A link to my previous articles can be found here.
Project Description: This project involves the prediction of the price of buying a car based on certain attributes.
URL to Dataset: Download here
Line-by-line explanation of Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn import metrics
The block of codes above imports the third party libraries used in the model.
import pandas as pd
imports the pandas library which is used to analyze data.
import matplotlib.pyplot as plt
imports the pyplot function from the matplotlib library which is used to visualize data and trends in the data.
import seaborn as sns
imports the seaborn library which is used for making statistical graphics. It builds on top of matplotlib and integrates closely with pandas data structures. Seaborn helps explore and understand your data.
from sklearn.model_selection import train_test_split
imports the train_test_split
function from sklearn's model_selection library. It will be used in spliting arrays or matrices into random train and test subsets.
from sklearn.linear_model import Lasso
imports the Lasso linear regresssion machine learning model from sklearn's linear_model library.
from sklearn import metrics
imports the metrics library from the sklearn library. This model is used to ascertain the performance of our model.
salim_car_dataset = pd.read_csv(r'C:\Users\OYINLOLA SALIM O\Downloads\car data.csv')
This line of code loads the data from csv file to a pandas dataframe named salim_car_dataset
.
salim_car_dataset.head()
This line of code displays the first 5 rows of the dataframe.
salim_car_dataset.shape
This line of code checks the number of rows and columns. The observed output is (301, 9)
.
salim_car_dataset.isnull().sum()
This line of code checks the number of missing values. It is seen that there is no missing value in the dataset.
print(salim_car_dataset.Fuel_Type.value_counts())
print(salim_car_dataset.Seller_Type.value_counts())
print(salim_car_dataset.Transmission.value_counts())
This block of code checks the distribution of categorical data in the Fuel_Type
, Seller_Type
and Transmission
columns of the dataset.
In the Fuel_Type
column, we see:
Petrol
239
Diesel
60
CNG
2
In the Seller_Type
column, we see:
Dealer
195
Individual
106
In the Transmission
column, we see:
Manual
261
Automatic
40
salim_car_dataset.replace({'Fuel_Type':{'Petrol':0,'Diesel':1,'CNG':2}},inplace=True)
This line of code encodes the Fuel_Type
Column.
salim_car_dataset.replace({'Seller_Type':{'Dealer':0,'Individual':1}},inplace=True)
This line of code encodes the Seller_Type
Column.
salim_car_dataset.replace({'Transmission':{'Manual':0,'Automatic':1}},inplace=True)
This line of code encodes the Transmission
Column.
salim_car_dataset.head()
This line of code prints the first five rows of the dataset to show the label encoded dataset.
X = salim_car_dataset.drop(['Car_Name','Selling_Price'],axis=1)
Y = salim_car_dataset['Selling_Price']
This block of code separates the data and Label i.e. into X
and Y
.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1, random_state=2)
The train_test_split method is hence called and used to divide the dataset into train set and test set.
lass_reg_model = Lasso()
This line of code loads the lasso regression model by creating an instance.
lass_reg_model.fit(X_train,Y_train)
This line of code trains the model with the train dataset. (i.e. X_train
and Y_train
)
training_data_prediction = lass_reg_model.predict(X_train)
This line of code predicts on the Training data.
error_score = metrics.r2_score(Y_train, training_data_prediction)
print("R squared Error : ", error_score)
This block of codes displays the R squared Error. The R squared Error is given as 0.8427856123435794
.
plt.scatter(Y_train, training_data_prediction)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title(" Actual Prices vs Predicted Prices")
plt.show()
This block of codes helps in visualizing the actual prices and Predicted prices.
test_data_prediction = lass_reg_model.predict(X_test)
This line of code predicts on the test data.
error_score = metrics.r2_score(Y_test, test_data_prediction)
print("R squared Error : ", error_score)
This block of codes displays the R squared Error. The R squared Error is given as 0.8709167941173195
.
plt.scatter(Y_test, test_data_prediction)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title(" Actual Prices vs Predicted Prices")
plt.show()
This block of codes helps in visualizing the actual prices and Predicted prices.
That's it for this project. Be sure to like, share and keep the discussion going in the comment section. .ipynb
file containing the full code can be found here.