SparkML

We’ll first start by installing Spark, importing PySpark then we’ll move on to SparkML

PySpark Recap

We already know the details of Spark from the Spark section of this site. Here we’ll setup SparkML a library in Spark designed to allow us to process many ML processes.

Objectives

Use PySpark to connect to a spark cluster.
Create a spark session.
Read a csv file into a data frame using the spark session.
Stop the spark session

Data

Modified version of car mileage dataset. Original dataset available at https://archive.ics.uci.edu/ml/datasets/auto+mpg
Modified version of diamonds dataset. Original dataset available at https://www.openml.org/search?type=data&sort=runs&id=42225&status=active

Setup

Install Libraries

This first code block is to be used from within quarto, but I will no longer use it because I cannot connect to Spark from within Quarto

library(reticulate)

Suppress Warnings

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass

import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# import sparksession
from pyspark.sql import SparkSession

# This next library is used to edit or create a schema but is not needed this time
#from pyspark import SparkContext, SparkConf

# Findspark simplifies the process of using Spark with Python
import findspark
findspark.init()

Create Spark Session

Create sparksession
Name it: Spark & ML Setup

# Creating a SparkContext object  
sc = SparkContext.getOrCreate()

spark = SparkSession \
    .builder \
    .appName("SparkML Setup") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

Import CSV Data

Import the cars mgp.csv file from the web

import wget

wget.download(" https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/mpg.csv")

Load Data into Pandas df

File already has a header so we’ll set header= True
Ask Spark to automatically find the data types of the columns from the data by setting inferSchema=True

mpg_data = spark.read.csv("mpg.csv", header=True, inferSchema=True)

Preview Data

Schema

# Preview Schema
mpg_data.printSchema()

root
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Engine Disp: double (nullable = true)
 |-- Horsepower: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Accelerate: double (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Origin: string (nullable = true)

View Data

# Preview Data
mpg_data.show(5)
+----+---------+-----------+----------+------+----------+----+--------+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year|  Origin|
+----+---------+-----------+----------+------+----------+----+--------+
|15.0|        8|      390.0|       190|  3850|       8.5|  70|American|
|21.0|        6|      199.0|        90|  2648|      15.0|  70|American|
|18.0|        6|      199.0|        97|  2774|      15.5|  70|American|
|16.0|        8|      304.0|       150|  3433|      12.0|  70|American|
|14.0|        8|      455.0|       225|  3086|      10.0|  70|American|
+----+---------+-----------+----------+------+----------+----+--------+
only showing top 5 rows

View Spark UI

To view the UI when running Spark on the local machine

http://yashaya.lan:4040

Stop Spark

spark.stop()

Jupyter notebook can be found at ML/Spark_setup