Import Libraries

LR Model in Spark - MPG

This is an expansion on Build LR & Test/Evaluate LR Model in Python - MPG examples but this time we’ll use SparkML to build a LR Model that will predict the MPG of a car

Objectives

Use PySpark to connect to a spark cluster.
Create a spark session.
Read a csv file into a data frame.
Split the dataset into training and testing sets.
Use VectorAssembler to combine multiple columns into a single vector column
Use Linear Regression to build a prediction model.
Use metrics to evaluate the model.
Stop the spark session

Setup

We will be using the following libraries:

PySpark for connecting to the Spark Cluster

Suppress Warnings

To suppress warnings generated by our code, we’ll use this code block

# To suppress warnings generated by the code
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python
import findspark
findspark.init()

from pyspark.sql import SparkSession

#import functions/Classes for sparkml
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

# import functions/Classes for metrics
from pyspark.ml.evaluation import RegressionEvaluator

Start Spark Session - Task 1

#Create SparkSession
#Ignore any warnings by SparkSession command

spark = SparkSession.builder.appName("Regressing using SparkML").getOrCreate()

Data - Task 2

Modified version of car mileage dataset. Available at https://archive.ics.uci.edu/ml/datasets/auto+mpg
Modified version of diamonds dataset. Available at https://www.openml.org/search?type=data&sort=runs&id=42225&status=active

Download Data Locally

import wget
wget.download ("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/mpg.csv")

Load CSV into SparkDF

# using the spark.read.csv function we load the data into a dataframe.
# the header = True mentions that there is a header row in out csv file
# the inferSchema = True, tells spark to automatically find out the data types of the columns.

# Load mpg dataset
mpg_data = spark.read.csv("mpg.csv", header=True, inferSchema=True)

View Schema

mpg_data.printSchema()
root
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Engine Disp: double (nullable = true)
 |-- Horsepower: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Accelerate: double (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Origin: string (nullable = true)

View Data

mpg_data.show(5)
+----+---------+-----------+----------+------+----------+----+--------+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year|  Origin|
+----+---------+-----------+----------+------+----------+----+--------+
|15.0|        8|      390.0|       190|  3850|       8.5|  70|American|
|21.0|        6|      199.0|        90|  2648|      15.0|  70|American|
|18.0|        6|      199.0|        97|  2774|      15.5|  70|American|
|16.0|        8|      304.0|       150|  3433|      12.0|  70|American|
|14.0|        8|      455.0|       225|  3086|      10.0|  70|American|
+----+---------+-----------+----------+------+----------+----+--------+
only showing top 5 rows

Identify Label/Input - Task 3

Here is how we did this step in Python:

Target

In LR models we aim to predict the Target value given Input/Data.

So, in this example we are trying to find the MPG which is the Target Column in our table

Features

The feature(s) is/are the data columns we will provide our model with as input from which we want it to predict the Target Value/Column

In our example let’s provide the model with these Features, and see how accurate it will be in predicting the MPG

Horsepower
Weight

In our case here we will be using the VectorAssembler to group several inputCols as a single column named Features which as defined will be the Input for our model. All the combined input columns will be putout as the outputCol named features

# Prepare feature vector
assembler = VectorAssembler(inputCols=["Cylinders", "Engine Disp", "Horsepower", "Weight", "Accelerate", "Year"], outputCol="features")
mpg_transformed_data = assembler.transform(mpg_data)

Preview Columns

Let’s take a look at the assembled features (the collection of all the columns we’ll use as input), and the Target column MPG which will be the value we wish for the model to predict.

As you see in the output, the first column is the combination of the columns listed in inputCols above

mpg_transformed_data.select("features","MPG").show()

+--------------------+----+
|            features| MPG|
+--------------------+----+
|[8.0,390.0,190.0,...|15.0|
|[6.0,199.0,90.0,2...|21.0|
|[6.0,199.0,97.0,2...|18.0|
|[8.0,304.0,150.0,...|16.0|
|[8.0,455.0,225.0,...|14.0|
|[8.0,350.0,165.0,...|15.0|
|[8.0,307.0,130.0,...|18.0|
|[8.0,454.0,220.0,...|14.0|
|[8.0,400.0,150.0,...|15.0|
|[8.0,307.0,200.0,...|10.0|
|[8.0,383.0,170.0,...|15.0|
|[8.0,318.0,210.0,...|11.0|
|[8.0,360.0,215.0,...|10.0|
|[8.0,429.0,198.0,...|15.0|
|[6.0,200.0,85.0,2...|21.0|
|[8.0,302.0,140.0,...|17.0|
|[8.0,304.0,193.0,...| 9.0|
|[8.0,340.0,160.0,...|14.0|
|[6.0,198.0,95.0,2...|22.0|
|[8.0,440.0,215.0,...|14.0|
+--------------------+----+
only showing top 20 rows

Split the Data - Task 4

We split the data using a ratio of 70/30. 70 training and 30 testing using a random seed of 42. This controls the shuffling applied to the data before applying the split.

We set it so it is possible to reproduce the same results across multiple function calls

# Split data into training and testing sets
(training_data, testing_data) = mpg_transformed_data.randomSplit([0.7, 0.3], seed=42)

Build/Train LR Model - Task 5

# Train linear regression model
# Ignore any warnings

lr = LinearRegression(featuresCol="features", labelCol="MPG")
model = lr.fit(training_data)

Evaluate Model - Task 6

Predict MPG

Let’s make some predictions since our model is trained on 70% of the data. So now we use the test data to make the predictions

# Make predictions on testing data
predictions = model.transform(testing_data)

R - Squared

Higher values indicate better performance.

#R-squared (R2): R2 is a statistical measure that represents the proportion of variance
#in the dependent variable (target) that is explained by the independent variables (features).


evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)
print("R Squared =", r2)

# OUTPUT
R Squared = 0.7685170910359864

Root Mean Squared Error

Lower values indicate better performance

#Root Mean Squared Error (RMSE): RMSE is the square root of the average of the squared differences
#between the predicted and actual values. It measures the average distance between the predicted
#and actual values, and lower values indicate better performance.

evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE =", rmse)

# OUTPUT
RMSE = 3.7870898850307

Mean Absolute Error

Lower values indicate better performance

#Mean Absolute Error (MAE): MAE is the average of the absolute differences between the predicted and
#actual values. It measures the average absolute distance between the predicted and actual values, and
#lower values indicate better performance.

evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)
print("MAE =", mae)

# OUTPUT
MAE = 2.990444141294136

Observations

What’s noticeable is the fact that in this exercise we used multiple columns as input in order to predict the MPG while in the original example we did using python with just TWO Features: Horsepower & Weight, the metrics of that LN Model were more accurate (in other words were better by a small margin than what we just tested here).

spark.stop()