# To suppress warnings generated by the code
def warn(*args, **kwargs):
pass
import warnings
= warn
warnings.warn 'ignore') warnings.filterwarnings(
LR Model in Spark - MPG
This is an expansion on Build LR & Test/Evaluate LR Model in Python - MPG examples but this time we’ll use SparkML to build a LR Model that will predict the MPG of a car
Objectives
- Use PySpark to connect to a spark cluster.
- Create a spark session.
- Read a csv file into a data frame.
- Split the dataset into training and testing sets.
- Use VectorAssembler to combine multiple columns into a single vector column
- Use Linear Regression to build a prediction model.
- Use metrics to evaluate the model.
- Stop the spark session
Setup
We will be using the following libraries:
PySpark
for connecting to the Spark Cluster
Suppress Warnings
To suppress warnings generated by our code, we’ll use this code block
Import Libraries
# FindSpark simplifies the process of using Apache Spark with Python
import findspark
findspark.init()
from pyspark.sql import SparkSession
#import functions/Classes for sparkml
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
# import functions/Classes for metrics
from pyspark.ml.evaluation import RegressionEvaluator
Start Spark Session - Task 1
#Create SparkSession
#Ignore any warnings by SparkSession command
= SparkSession.builder.appName("Regressing using SparkML").getOrCreate() spark
Data - Task 2
- Modified version of car mileage dataset. Available at https://archive.ics.uci.edu/ml/datasets/auto+mpg
- Modified version of diamonds dataset. Available at https://www.openml.org/search?type=data&sort=runs&id=42225&status=active
Download Data Locally
import wget
"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/mpg.csv") wget.download (
Load CSV into SparkDF
# using the spark.read.csv function we load the data into a dataframe.
# the header = True mentions that there is a header row in out csv file
# the inferSchema = True, tells spark to automatically find out the data types of the columns.
# Load mpg dataset
= spark.read.csv("mpg.csv", header=True, inferSchema=True) mpg_data
View Schema
mpg_data.printSchema()
root|-- MPG: double (nullable = true)
|-- Cylinders: integer (nullable = true)
|-- Engine Disp: double (nullable = true)
|-- Horsepower: integer (nullable = true)
|-- Weight: integer (nullable = true)
|-- Accelerate: double (nullable = true)
|-- Year: integer (nullable = true)
|-- Origin: string (nullable = true)
View Data
5)
mpg_data.show(+----+---------+-----------+----------+------+----------+----+--------+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year| Origin|
+----+---------+-----------+----------+------+----------+----+--------+
|15.0| 8| 390.0| 190| 3850| 8.5| 70|American|
|21.0| 6| 199.0| 90| 2648| 15.0| 70|American|
|18.0| 6| 199.0| 97| 2774| 15.5| 70|American|
|16.0| 8| 304.0| 150| 3433| 12.0| 70|American|
|14.0| 8| 455.0| 225| 3086| 10.0| 70|American|
+----+---------+-----------+----------+------+----------+----+--------+
5 rows only showing top
Identify Label/Input - Task 3
Here is how we did this step in Python:
Target
In LR models we aim to predict the Target value given Input/Data.
So, in this example we are trying to find the MPG which is the Target Column in our table
Features
The feature(s) is/are the data columns we will provide our model with as input from which we want it to predict the Target Value/Column
In our example let’s provide the model with these Features, and see how accurate it will be in predicting the MPG
- Horsepower
- Weight
In our case here we will be using the VectorAssembler to group several inputCols as a single column named Features which as defined will be the Input for our model. All the combined input columns will be putout as the outputCol named features
# Prepare feature vector
= VectorAssembler(inputCols=["Cylinders", "Engine Disp", "Horsepower", "Weight", "Accelerate", "Year"], outputCol="features")
assembler = assembler.transform(mpg_data) mpg_transformed_data
Preview Columns
Let’s take a look at the assembled features (the collection of all the columns we’ll use as input), and the Target column MPG which will be the value we wish for the model to predict.
- As you see in the output, the first column is the combination of the columns listed in inputCols above
"features","MPG").show()
mpg_transformed_data.select(
+--------------------+----+
| features| MPG|
+--------------------+----+
|[8.0,390.0,190.0,...|15.0|
|[6.0,199.0,90.0,2...|21.0|
|[6.0,199.0,97.0,2...|18.0|
|[8.0,304.0,150.0,...|16.0|
|[8.0,455.0,225.0,...|14.0|
|[8.0,350.0,165.0,...|15.0|
|[8.0,307.0,130.0,...|18.0|
|[8.0,454.0,220.0,...|14.0|
|[8.0,400.0,150.0,...|15.0|
|[8.0,307.0,200.0,...|10.0|
|[8.0,383.0,170.0,...|15.0|
|[8.0,318.0,210.0,...|11.0|
|[8.0,360.0,215.0,...|10.0|
|[8.0,429.0,198.0,...|15.0|
|[6.0,200.0,85.0,2...|21.0|
|[8.0,302.0,140.0,...|17.0|
|[8.0,304.0,193.0,...| 9.0|
|[8.0,340.0,160.0,...|14.0|
|[6.0,198.0,95.0,2...|22.0|
|[8.0,440.0,215.0,...|14.0|
+--------------------+----+
20 rows only showing top
Split the Data - Task 4
We split the data using a ratio of 70/30. 70 training and 30 testing using a random seed of 42. This controls the shuffling applied to the data before applying the split.
We set it so it is possible to reproduce the same results across multiple function calls
# Split data into training and testing sets
= mpg_transformed_data.randomSplit([0.7, 0.3], seed=42) (training_data, testing_data)
Build/Train LR Model - Task 5
# Train linear regression model
# Ignore any warnings
= LinearRegression(featuresCol="features", labelCol="MPG")
lr = lr.fit(training_data) model
Evaluate Model - Task 6
Predict MPG
Let’s make some predictions since our model is trained on 70% of the data. So now we use the test data to make the predictions
# Make predictions on testing data
= model.transform(testing_data) predictions
R - Squared
Higher values indicate better performance.
#R-squared (R2): R2 is a statistical measure that represents the proportion of variance
#in the dependent variable (target) that is explained by the independent variables (features).
= RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="r2")
evaluator = evaluator.evaluate(predictions)
r2 print("R Squared =", r2)
# OUTPUT
= 0.7685170910359864 R Squared
Root Mean Squared Error
Lower values indicate better performance
#Root Mean Squared Error (RMSE): RMSE is the square root of the average of the squared differences
#between the predicted and actual values. It measures the average distance between the predicted
#and actual values, and lower values indicate better performance.
= RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="rmse")
evaluator = evaluator.evaluate(predictions)
rmse print("RMSE =", rmse)
# OUTPUT
= 3.7870898850307 RMSE
Mean Absolute Error
Lower values indicate better performance
#Mean Absolute Error (MAE): MAE is the average of the absolute differences between the predicted and
#actual values. It measures the average absolute distance between the predicted and actual values, and
#lower values indicate better performance.
= RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="mae")
evaluator = evaluator.evaluate(predictions)
mae print("MAE =", mae)
# OUTPUT
= 2.990444141294136 MAE
Observations
What’s noticeable is the fact that in this exercise we used multiple columns as input in order to predict the MPG while in the original example we did using python with just TWO Features: Horsepower & Weight, the metrics of that LN Model were more accurate (in other words were better by a small margin than what we just tested here).
spark.stop()