SparkML

We’ll first start by installing Spark, importing PySpark then we’ll move on to SparkML

PySpark Recap


We already know the details of Spark from the Spark section of this site. Here we’ll setup SparkML a library in Spark designed to allow us to process many ML processes.

Objectives


  • Use PySpark to connect to a spark cluster.
  • Create a spark session.
  • Read a csv file into a data frame using the spark session.
  • Stop the spark session

Data


Setup


Install Libraries

  • This first code block is to be used from within quarto, but I will no longer use it because I cannot connect to Spark from within Quarto
library(reticulate)

Suppress Warnings

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass

import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')
# import sparksession
from pyspark.sql import SparkSession

# This next library is used to edit or create a schema but is not needed this time
#from pyspark import SparkContext, SparkConf

# Findspark simplifies the process of using Spark with Python
import findspark
findspark.init()

Create Spark Session


  • Create sparksession
  • Name it: Spark & ML Setup
# Creating a SparkContext object  
sc = SparkContext.getOrCreate()

spark = SparkSession \
    .builder \
    .appName("SparkML Setup") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

Import CSV Data


  • Import the cars mgp.csv file from the web
import wget

wget.download(" https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/mpg.csv")

Load Data into Pandas df

  • File already has a header so we’ll set header= True
  • Ask Spark to automatically find the data types of the columns from the data by setting inferSchema=True
mpg_data = spark.read.csv("mpg.csv", header=True, inferSchema=True)

Preview Data

Schema

# Preview Schema
mpg_data.printSchema()

root
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Engine Disp: double (nullable = true)
 |-- Horsepower: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Accelerate: double (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Origin: string (nullable = true)

View Data

# Preview Data
mpg_data.show(5)
+----+---------+-----------+----------+------+----------+----+--------+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year|  Origin|
+----+---------+-----------+----------+------+----------+----+--------+
|15.0|        8|      390.0|       190|  3850|       8.5|  70|American|
|21.0|        6|      199.0|        90|  2648|      15.0|  70|American|
|18.0|        6|      199.0|        97|  2774|      15.5|  70|American|
|16.0|        8|      304.0|       150|  3433|      12.0|  70|American|
|14.0|        8|      455.0|       225|  3086|      10.0|  70|American|
+----+---------+-----------+----------+------+----------+----+--------+
only showing top 5 rows

View Spark UI


To view the UI when running Spark on the local machine

http://yashaya.lan:4040

Stop Spark


spark.stop()

Jupyter notebook can be found at ML/Spark_setup