import numpy as np
import matplotlib.pyplot as plt
from utils import *
%matplotlib inline
Anomaly Computer Servers
Letβs implement an anomaly detection algorithm to detect anomalous behavior in server computers.
The dataset contains two features -
- throughput (mb/s) and
- latency (ms) of response of each server.
While your servers were operating, you collected π=307m=307 examples of how they were behaving, and thus have an unlabeled dataset {π₯(1),β¦,π₯(π)}{x(1),β¦,x(m)}.
- You suspect that the vast majority of these examples are βnormalβ (non-anomalous) examples of the servers operating normally, but there might also be some examples of servers acting anomalously within this dataset.
You will use a Gaussian model to detect anomalous examples in your dataset.
- You will first start on a 2D dataset that will allow you to visualize what the algorithm is doing.
- On that dataset you will fit a Gaussian distribution and then find values that have very low probability and hence can be considered anomalies.
- After that, you will apply the anomaly detection algorithm to a larger dataset with many dimensions.
Data
The load_data()
function shown below loads the data into the variables X_train
, X_val
and y_val
- You will use
X_train
to fit a Gaussian distribution - You will use
X_val
andy_val
as a cross validation set to select a threshold and determine anomalous vs normal examples
# Load the dataset
= load_data() X_train, X_val, y_val
# Display the first five elements of X_train
print("The first 5 elements of X_train are:\n", X_train[:5])
# Display the first five elements of X_val
print("The first 5 elements of X_val are\n", X_val[:5])
# Display the first five elements of y_val
print("The first 5 elements of y_val are\n", y_val[:5])
print ('The shape of X_train is:', X_train.shape)
print ('The shape of X_val is:', X_val.shape)
print ('The shape of y_val is: ', y_val.shape)
The shape of X_train is: (307, 2)
The shape of X_val is: (307, 2)
The shape of y_val is: (307,)
Before starting on any task, it is often useful to understand the data by visualizing it.
- For this dataset, you can use a scatter plot to visualize the data (
X_train
), since it has only two properties to plot (throughput and latency) - Your plot should look similar to the one below
# Create a scatter plot of the data. To change the markers to blue "x",
# we used the 'marker' and 'c' parameters
0], X_train[:, 1], marker='x', c='b')
plt.scatter(X_train[:,
# Set the title
"The first dataset")
plt.title(# Set the y-axis label
'Throughput (mb/s)')
plt.ylabel(# Set the x-axis label
'Latency (ms)')
plt.xlabel(# Set axis range
0, 30, 0, 30])
plt.axis([ plt.show()
Gaussian DIstribution
Fit model to the dataβs distributio
- Given a training set {π₯(1),β¦,π₯(π)}{x(1),β¦,x(m)} you want to estimate the Gaussian distribution for each of the features π₯πxi.
- Recall that the Gaussian distribution is given by
- where πΞΌ is the mean and π2Ο2 is the variance.
- For each feature π=1β¦πi=1β¦n, you need to find parameters ππΞΌi and π2πΟi2 that fit the data in the πi-th dimension {π₯(1)π,β¦,π₯(π)π}{xi(1),β¦,xi(m)} (the πi-th dimension of each example).
Estimate the parameters ( \(\mu\) i, \(\sigma\) i2) of the i-th feature by using the following equations.
Estimate the mean
and for the variance you will use
def estimate_gaussian(X):
"""
Calculates mean and variance of all features
in the dataset
Args:
X (ndarray): (m, n) Data matrix
Returns:
mu (ndarray): (n,) Mean of all features
var (ndarray): (n,) Variance of all features
"""
= X.shape
m, n
= 1 / m * np.sum(X, axis = 0)
mu = 1 / m * np.sum((X - mu) ** 2, axis = 0)
var
return mu, var
# Estimate mean and variance of each feature
= estimate_gaussian(X_train)
mu, var
print("Mean of each feature:", mu)
print("Variance of each feature:", var)
Mean of each feature: [14.11222578 14.99771051]
Variance of each feature: [1.83263141 1.70974533]
Visualize Gaussian Distribution
# Returns the density of the multivariate normal
# at each data point (row) of X_train
= multivariate_gaussian(X_train, mu, var)
p
#Plotting code
visualize_fit(X_train, mu, var)
Select Threshold
Now that you have estimated the Gaussian parameters, you can investigate which examples have a very high probability given this distribution and which examples have a very low probability.
- The low probability examples are more likely to be the anomalies in our dataset.
- One way to determine which examples are anomalies is to select a threshold based on a cross validation set.
In this section, you will complete the code in select_threshold
to select the threshold πΞ΅ using the πΉ1F1 score on a cross validation set.
- For this, we will use a cross validation set {(π₯(1)cv,π¦(1)cv),β¦,(π₯(πcv)cv,π¦(πcv)cv)}{(xcv(1),ycv(1)),β¦,(xcv(mcv),ycv(mcv))}, where the label π¦=1y=1 corresponds to an anomalous example, and π¦=0y=0 corresponds to a normal example.
- For each cross validation example, we will compute π(π₯(π)cv)p(xcv(i)). The vector of all of these probabilities π(π₯(1)cv),β¦,π(π₯(πcv)cv)p(xcv(1)),β¦,p(xcv(mcv)) is passed to
select_threshold
in the vectorp_val
. - The corresponding labels π¦(1)cv,β¦,π¦(πcv)cvycv(1),β¦,ycv(mcv) are passed to the same function in the vector
y_val
.
Function Below
To find the best threshold to use for selecting outliers based on the results from the validation set (p_val
) and the ground truth (y_val
).
- In the provided code
select_threshold
, there is already a loop that will try many different values of πΞ΅ and select the best πΞ΅ based on the πΉ1F1 score. - You need to implement code to calculate the F1 score from choosing
epsilon
as the threshold and place the value inF1
.Recall that if an example π₯x has a low probability π(π₯)<πp(x)<Ξ΅, then it is classified as an anomaly.
Then, you can compute precision and recall by:
where
- π‘πtp is the number of true positives: the ground truth label says itβs an anomaly and our algorithm correctly classified it as an anomaly.
- ππfp is the number of false positives: the ground truth label says itβs not an anomaly, but our algorithm incorrectly classified it as an anomaly.
- ππfn is the number of false negatives: the ground truth label says itβs an anomaly, but our algorithm incorrectly classified it as not being anomalous.
- The πΉ1F1 score is computed using precision (ππππprec) and recall (πππrec) as follows:
NOTE
- If you have several binary values in an πn-dimensional binary vector, you can find out how many values in this vector are 0 by using: `np.sum(v == 0)`
- You can also apply a logical *and* operator to such binary vectors. For instance, `predictions` is a binary vector of the size of your number of cross validation set, where the πi-th element is 1 if your algorithm considers π₯(π)cvxcv(i) an anomaly, and 0 otherwise.
- You can then, for example, compute the number of false positives using:
fp = sum((predictions == 1) & (y_val == 0))
def select_threshold(y_val, p_val):
"""
Finds the best threshold to use for selecting outliers
based on the results from a validation set (p_val)
and the ground truth (y_val)
Args:
y_val (ndarray): Ground truth on validation set
p_val (ndarray): Results on validation set
Returns:
epsilon (float): Threshold chosen
F1 (float): F1 score by choosing epsilon as threshold
"""
= 0
best_epsilon = 0
best_F1 = 0
F1
= (max(p_val) - min(p_val)) / 1000
step_size
for epsilon in np.arange(min(p_val), max(p_val), step_size):
= (p_val < epsilon)
predictions
# calculate number of true positives
= np.sum((predictions == 1) & (y_val == 1))
tp # calculate number of false positives
= sum((predictions == 1) & (y_val == 0))
fp # Your code here to calculate number of false negatives
= np.sum((predictions == 0) & (y_val == 1))
fn
# calculate precision
= tp / (tp + fp)
prec # calculate recall
= tp / (tp+fn)
rec # calculate F1
= 2 * prec * rec / (prec + rec)
F1
if F1 > best_F1:
= F1
best_F1 = epsilon
best_epsilon
return best_epsilon, best_F1
= multivariate_gaussian(X_val, mu, var)
p_val = select_threshold(y_val, p_val)
epsilon, F1
print('Best epsilon found using cross-validation: %e' % epsilon)
print('Best F1 on Cross Validation Set: %f' % F1)
Best epsilon found using cross-validation: 8.990853e-05
Best F1 on Cross Validation Set: 0.875000
Detect Anomaly
Run anomaly detection code and circle the anomalies in the plot
Real Data Test
Now, we will run the anomaly detection algorithm that you implemented on a more realistic and much harder dataset.
In this dataset, each example is described by 11 features, capturing many more properties of your compute servers.
Letβs start by loading the dataset.
- The
load_data()
function shown below loads the data into variablesX_train_high
,X_val_high
andy_val_high
_high
is meant to distinguish these variables from the ones used in the previous part- We will use
X_train_high
to fit Gaussian distribution - We will use
X_val_high
andy_val_high
as a cross validation set to select a threshold and determine anomalous vs normal examples
# load the dataset
= load_data_multi() X_train_high, X_val_high, y_val_high
print ('The shape of X_train_high is:', X_train_high.shape)
print ('The shape of X_val_high is:', X_val_high.shape)
print ('The shape of y_val_high is: ', y_val_high.shape)
The shape of X_train_high is: (1000, 11)
The shape of X_val_high is: (100, 11)
The shape of y_val_high is: (100,)
Detect Anomaly
Letβs run the anomaly algorithm on the new dataset. Here are the steps
- Estimate the Gaussian parameters (ππΞΌi and π2πΟi2)
- Evaluate the probabilities for both the training data
X_train_high
from which you estimated the Gaussian parameters, as well as for the the cross-validation setX_val_high
. - Finally, it will use
select_threshold
to find the best threshold πΞ΅.
# Apply the same steps to the larger dataset
# Estimate the Gaussian parameters
= estimate_gaussian(X_train_high)
mu_high, var_high
# Evaluate the probabilites for the training set
= multivariate_gaussian(X_train_high, mu_high, var_high)
p_high
# Evaluate the probabilites for the cross validation set
= multivariate_gaussian(X_val_high, mu_high, var_high)
p_val_high
# Find the best threshold
= select_threshold(y_val_high, p_val_high)
epsilon_high, F1_high
print('Best epsilon found using cross-validation: %e'% epsilon_high)
print('Best F1 on Cross Validation Set: %f'% F1_high)
print('# Anomalies found: %d'% sum(p_high < epsilon_high))
Best epsilon found using cross-validation: 1.377229e-18
Best F1 on Cross Validation Set: 0.615385
# Anomalies found: 117