Classifier for Cancer Detection

Objectives

We will use Breast Cancer dataset to create a classifier that can detect cancerous tumors

Here is a list of Tasks we’ll be doing:

Use Pandas to load data sets.
Identify the target and features.
Use Logistic Regression to build a classifier.
Use metrics to evaluate the model.
Make predictions using a trained model.

Setup

We will be using the following libraries:

pandas for managing the data.
sklearn for machine learning and machine-learning-pipeline related functions.

Install Libraries

# All Libraries required for this lab are listed below. Already installed so this code block is commented out
pip install pandas==1.3.4
pip install scikit-learn==1.0.2

Suppress Warnings

To suppress warnings generated by our code, we’ll use this code block

# To suppress warnings generated by the code
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

Import Libraries

import pandas as pd
from sklearn.linear_model import LogisticRegression

Data - Task 1

Modified version of Breast Cancer dataset. Original dataset available at https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

Load

# the data set is available at the url below.
URL2 = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/cancer.csv"

# using the read_csv function in the pandas library, we load the data into a dataframe
df2 = pd.read_csv(URL2)
# you can sample the data with df.sample(5)

	radius_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	symmetry_mean	diagnosis
135	12.77	81.72	506.3	0.09055	0.05761	0.04711	0.1585	Malignant
299	10.51	66.85	334.2	0.10150	0.06797	0.02495	0.1695	Benign
81	13.34	86.49	520.0	0.10780	0.15350	0.11690	0.1942	Benign
425	10.03	63.19	307.3	0.08117	0.03912	0.00247	0.1630	Benign
518	12.88	84.45	493.1	0.12180	0.16610	0.04825	0.1709	Benign

df2.shape

(569, 8)

Define Targets/Features - Task 2

Target

In Classification models, the Target is the value our machine learning model needs to classify

So, in this example we are trying to classify the diagnosis

Features

The feature(s) is/are the data columns we will provide our model as input from which our model learns from

In our example let’s provide the model with these Features, and see how accurate it will be in predicting the diagnosis

target = df2["diagnosis"]
features = df2[['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'symmetry_mean']]

Build & Train Classifier - Task 3

Logistic Regression Model

Create a Logistic Regression model

classifier2 = LogisticRegression()

Train Logistic Regression Model

Response will be:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class=‘warn’, n_jobs=None, penalty=‘l2’, random_state=None, solver=‘warn’, tol=0.0001, verbose=0, warm_start=False)

classifier2.fit(features,target)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluate Model - Task 4

Now that the model has been trained on the data/features provided above, let’s evaluate it

Score

The higher the better

#Higher the score, better the model.
classifier2.score(features,target)

0.8998242530755711

Predict

Let’s make some predictions:

Classify the tumor with

‘radius_mean’ = 13.45
‘perimeter_mean’= 86.6
‘area_mean’ = 555.1
‘smoothness_mean’ = 0.1022
‘compactness_mean’ = 0.08165
‘concavity_mean’ = 0.03974
‘symmetry_mean’ = 0.1638

classifier2.predict([[13.45,86.6,555.1,0.1022,0.08165,0.03974,0.1638]])

array(['Benign'], dtype=object)