Classification
Recap
Remember there are two types of supervised learning techniques: Classification and Regression.
Classification is the process of predicting a class label or category.
- Classification techniques are used for predicting the class or category of a case. For example, if a cell is benign or malignant, or whether a customer will churn.
- For example, “Will I pass or fail my biology test”?
- In this case, there are only two outcomes, and I can only be bucketed in one at a time; for example, either I will pass, or I will fail the exam.
- Classification answers the question “What category does this fall into?”
- On the other hand, regression answers “What will my biology exam score be?”
Classification is the process of predicting an outcome, known as a “class,” based on some given inputs. What are inputs?
Inputs
Let’s use our pass or fail example, “Will I pass or fail my biology exam?”
Input variables are independent variables or features that are used to make predictions.
- It can be one variable, for example, “Average score on previous biology tests” or
- multiple variables including the following features - “Percent of classes attended” - “Number of hours studied.”
So the direction of the arrow shows how the input variable helps answer the question, which then predicts the outcome of whether I will pass or fail. This means that the outcome of passing or failing my biology exam is dependent on these input variables and knowing these values can give me information about the outcome of the test.
Binary Classification
The previous example focused on predicting two classes: Pass or fail.
Another example is classifying if an email is spam or not spam.
When you have only two classes, it is called binary classification.
Multiclass Classification
When you have three or more classes, it is called multiclass classification. Predicting the handwritten digits from 1 to 9, or predicting if a piece of fruit is an apple, orange, or mango.
Before we go any further, let’s go through these very important terminologies.
Classifier
A classifier is a machine learning algorithm that is used to solve the classification problem.
Feature
A feature is the independent variable that is used as an input in the model.
Evaluation
When you ‘evaluate’ a model, you are validating how well it has performed.
Algorithms
You can divide classification algorithms into two.
Lazy Learner
A “lazy learner” doesn’t have a training phase per se ,as it waits to have a test data set before making predictions.
- This means that it doesn’t generalize the model, therefore taking longer to predict.
KNN
A very popular example is the k-nearest neighbor algorithm, also known as KNN. KNN classifies the unknown data points by finding the most common classes in the k-nearest examples. Then, it finds the closest match to the test data.
If we continue using our exam grade example.
- Consider when two sets of points are given on a plane.
- One set is a class of grey circles that represent students who failed, and the other set is a class of blue circles that represent the students who passed.
- If I appear on the plane and want to predict if I will pass or fail, KNN finds the k most similar students to me based on some inputs.
- Then, it calculates the distance from them, with k being the number of neighbors it checks.
- Let’s assume k is 5.
- K will classify my grade as a “pass,” using a majority vote approach.
- Here, four out of five of my ”neighbors” are classified as “pass.”
Eager Learner
The second kind of learner is the “eager learner.”
The eager learner spends a lot of time training and generalizing the model, so it spends less time predicting the test data set.
Logistic Regression
You can also use logistic regression. This model is used to predict the probability of a class.
For example, given the number of classes I attended, what is the probability I will pass or fail my biology exam?
Decision Trees
Finally, you have decision trees, which are tree-like algorithms that use an ”If-then” rule.
In this example, it classifies if you will pass or fail based on some rules.
Other advanced algorithms are