library(reticulate)
SparkML
We’ll first start by installing Spark, importing PySpark then we’ll move on to SparkML
PySpark Recap
We already know the details of Spark from the Spark section of this site. Here we’ll setup SparkML a library in Spark designed to allow us to process many ML processes.
Objectives
- Use PySpark to connect to a spark cluster.
- Create a spark session.
- Read a csv file into a data frame using the spark session.
- Stop the spark session
Data
- Modified version of car mileage dataset. Original dataset available at https://archive.ics.uci.edu/ml/datasets/auto+mpg
- Modified version of diamonds dataset. Original dataset available at https://www.openml.org/search?type=data&sort=runs&id=42225&status=active
Setup
Install Libraries
- This first code block is to be used from within quarto, but I will no longer use it because I cannot connect to Spark from within Quarto
Suppress Warnings
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
pass
import warnings
= warn
warnings.warn 'ignore') warnings.filterwarnings(
# import sparksession
from pyspark.sql import SparkSession
# This next library is used to edit or create a schema but is not needed this time
#from pyspark import SparkContext, SparkConf
# Findspark simplifies the process of using Spark with Python
import findspark
findspark.init()
Create Spark Session
- Create sparksession
- Name it: Spark & ML Setup
# Creating a SparkContext object
= SparkContext.getOrCreate()
sc
= SparkSession \
spark \
.builder "SparkML Setup") \
.appName("spark.some.config.option", "some-value") \
.config( .getOrCreate()
Import CSV Data
- Import the cars mgp.csv file from the web
import wget
" https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/mpg.csv") wget.download(
Load Data into Pandas df
- File already has a header so we’ll set header= True
- Ask Spark to automatically find the data types of the columns from the data by setting inferSchema=True
= spark.read.csv("mpg.csv", header=True, inferSchema=True) mpg_data
Preview Data
Schema
# Preview Schema
mpg_data.printSchema()
root|-- MPG: double (nullable = true)
|-- Cylinders: integer (nullable = true)
|-- Engine Disp: double (nullable = true)
|-- Horsepower: integer (nullable = true)
|-- Weight: integer (nullable = true)
|-- Accelerate: double (nullable = true)
|-- Year: integer (nullable = true)
|-- Origin: string (nullable = true)
View Data
# Preview Data
5)
mpg_data.show(+----+---------+-----------+----------+------+----------+----+--------+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year| Origin|
+----+---------+-----------+----------+------+----------+----+--------+
|15.0| 8| 390.0| 190| 3850| 8.5| 70|American|
|21.0| 6| 199.0| 90| 2648| 15.0| 70|American|
|18.0| 6| 199.0| 97| 2774| 15.5| 70|American|
|16.0| 8| 304.0| 150| 3433| 12.0| 70|American|
|14.0| 8| 455.0| 225| 3086| 10.0| 70|American|
+----+---------+-----------+----------+------+----------+----+--------+
5 rows only showing top
View Spark UI
To view the UI when running Spark on the local machine
//yashaya.lan:4040 http:
Stop Spark
spark.stop()
Jupyter notebook can be found at ML/Spark_setup