# All Libraries required for this lab are listed below. Already installed so this code block is commented out
==1.3.4
pip install pandas-learn==1.0.2
pip install scikit==1.21.6 pip install numpy
Evaluation of LR Model - Diamonds
Objectives
We will use diamonds dataset tobuild/Split/Train/Test a linear regression model that will predict the price of a diamond
Here is a list of Tasks we’ll be doing:
- Use Pandas to load data sets.
- Identify the target and features.
- Split the dataset into training and testing files
- Use Linear Regression to build/train a model to predict the price of diamonds.
- Test the model and create predictions.
- Use metrics to evaluate the model.
Jupyter notebook is found at: Building_and_training_a_model_using_Linear_Regression_1.ipynb
Setup
We will be using the following libraries:
pandas
for managing the data.sklearn
for machine learning and machine-learning-pipeline related functions.
Install Libraries
Suppress Warnings
To suppress warnings generated by our code, we’ll use this code block
# To suppress warnings generated by the code
def warn(*args, **kwargs):
pass
import warnings
= warn
warnings.warn 'ignore') warnings.filterwarnings(
Import Libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
#import functions for train test split
from sklearn.model_selection import train_test_split
# import functions for metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from math import sqrt
Data - Task 1
Load
= "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/diamonds.csv"
URL2
= pd.read_csv(URL2) df2
s | carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|---|
11756 | 11757 | 1.03 | Premium | H | VS2 | 62.4 | 58.0 | 5078 | 6.47 | 6.39 | 4.01 |
53799 | 53800 | 0.66 | Ideal | F | VVS2 | 61.2 | 57.0 | 2732 | 5.63 | 5.58 | 3.43 |
15163 | 15164 | 1.24 | Ideal | H | SI1 | 62.5 | 56.0 | 6095 | 6.92 | 6.88 | 4.31 |
46678 | 46679 | 0.59 | Ideal | E | SI1 | 62.9 | 57.0 | 1789 | 5.36 | 5.33 | 3.36 |
10855 | 10856 | 1.13 | Premium | J | VS1 | 61.1 | 59.0 | 4873 | 6.74 | 6.67 | 4.10 |
df2.shape
(53940, 11)
Define Targets/Features - Task 2
Target
In LR models we aim to predict the Target value given Input/Data.
So, in this example we are trying to find the price which is the Target Column in our table or y axis
Features
The feature(s) is/are the data columns (or x axis), we will provide our model with as input from which we want it to predict the Target Value/Column
In our example let’s provide the model with these Features, and see how accurate it will be in predicting the price
- carat
- depth
- table
= df2['price']
Y = df2[['carat','depth','table']] X
Split Dataset - Task 3
- We now split the data at 75/25 ratio, 75 training, 25 testing
- The random_state variable controls the shuffling applied to the data before applying the split.
- Pass the same integer for reproducible output across multiple function calls
= train_test_split(X, Y, test_size=0.25, random_state=42) X_train, X_test, Y_train, Y_test
Build/Train LR Model - Task 4
Define LR Model
= LinearRegression() lr2
Train/Fit LR Model
Let’s train it. The response will be
- Remember the formula was:
lr.fit( features,target)
- LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
lr2.fit(X_train, Y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Test/Evaluate Model - Task 5
Now that the model has been trained on the data/features provided above, let’s test it
Score
The higher the better in LR Models
#Higher the score, better the model. Remember Y is the target MPG
lr2.score(X_test, Y_test)
0.852949398522144
Predict
In order to calculate the score metrics we need two values:
- The original MPG which are the MPG from the test data set and we’ll compare to the predicted values
- The predicted MPG which are the results of the model
= Y_test
original_values = lr2.predict(X_test) predicted_values
Evaluation Metrics - Task 6
R squared
Higher the value the better the model
# Higher the value the better the model
r2_score(original_values, predicted_values)
0.852949398522144
Mean Squared Error - MSE
The lower the value the better the model
mean_squared_error(original_values, predicted_values)
np.float64(2310119.635474932)
Root MSE - RMSE
The lower the value the better the model
sqrt(mean_squared_error(original_values, predicted_values))
1519.9077720292544
Mean Absolute Error
The lower the value the better the model
mean_absolute_error(original_values, predicted_values)
np.float64(991.8625215831571)