import math, copy
import numpy as np
import matplotlib.pyplot as plt
'./deeplearning.mplstyle')
plt.style.use(from lab_utils_uni import plt_house_x, plt_contour_wgrad, plt_divergence, plt_gradients
# Load our data set
= np.array([1.0, 2.0]) #features
x_train = np.array([300.0, 500.0]) #target value y_train
Gradient Descent
My thanks and credit to Stanford University & Andrew Ng for their contributions.
An algorithm to find the minimum cost value of not only linear regression models but deep learning models as well. The purpose is to minimize the cost function, as it turns out it is very useful to minimize ANY function
It will give you the value that will minimize the cost of any function by providing the values of w & b regardless of how many parameters there are.
Recap
Remember the linear model we worked on the past two pages:
In linear regression, you utilize input training data to fit the parameters w,b by minimizing a measure of the error between our predictions fw,b(x(i)) and the actual data y(i). The measure is called the cost, J(w,b). In training you measure the cost over all of our training samples
The bow/hammock shape we got earlier is an indication of a regression model, but for neural networks we get a bubble shape like this below. You basically imagine yourself on top of one of the hills (dark red) and you want to make your way down to the lowest value (dark blue), if you move your starting point by one or two steps in any direction and travel down the easiest way it will most likely land you at a completely different lowest point than the previous one.
- So starting by choosing w, and the next time you choose a small variation away from w to start your descent
- Here we are assigning the new variance to w after we adjust w by (-alpha…..)
- alpha = learning rate, a number very close to zero, it is the size of the steps we are taking downhill, so we want to take smallest steps
- the learning rate is always positive. The d/dw is the slope of J(w,b) so as the slope gets closer to closer to zero we know we are getting closer to the minimum value.
- what’s after alpha is a derivative of function J which is the direction we want to take from the top of the hill
- we also assign a new value for b as well
- so just as we did take baby steps when we chose w and started the descent downhill, we do the same for b, we take small variance from b and take small steps and we keep repeating the descent for both w & b until we find a convergence
- Convergence happens when we reach the local minimum and neither w nor b change much after the next step, that’s when we know we are there
- One very important point is: you need to update the small variance for both b & w simultaneously. So notice that in the correct version, w used in the first line if the pre-updated w and it is the same pre-updated value that does into tmp_b as well.
- While looking at the incorrect way, the w used in tmp_b has already been updated and therefore incorrect because tmp_w has not been updated yet and that’s not being done simultaneously
- So in summary, the gradient descent is described as code below where parameters w and b are updated simultaneously
ALPHA - learning rate
In the same way if we choose a large number for the learning rate we end up overshooting the minimum value and never converge.
What if we don’t have a square error cost function (which is just one concave line (bowl shape)) as shown in the image above, but we have a curve with multiple peaks and local minimums, as you see above at the start of this page.
The question is: what does the gradient descent do if we already reached one of the minimums?
What we have to understand is even if we don’t vary \({\alpha}\) gradient descent will know slow down as we get closer to the minimum. Reason being is the derivative (which is the slope of J() at the point on the curve will decrease as the points get closer to the bottom/minimum so as the slope gets smaller the product of the formula will automatically get smaller even if we don’t lower the value of \({\alpha}\)
That’s why the gradient descent can reach a minimum with a fixed valued of \({\alpha}\)
GD for Linear Regression Model
So now let’s put the Gradient Descent & Cost Formula together to train a Linear Regression Model to best fit the line
So let’s bring them all together here:
Linear Regression Model
Cost Function
From earlier this is the same cost function from the previous page and we will use the function compute_cost in the code section below just as we did in previous section
Gradient Descent Algorithm
To break down the algorithm you will see that the math of the derivative will yield equations 4 & 5 on the right side below. In the code section below we will use compute_gradient to calculate them
Calculus
If you go through the calculus of the derivatives you’ll see this is how we arrived at the formula above
GD Algorithm
combined together will give us gradient descent which will be shown in the function gradient_descent in the code section below ( utilizes both compute_gradient and compute_cost)
- The first line shows the derivative of the cost function with respect to w (what follows \({\alpha}\))
- The second line shows the derivative of the cost function with respect to b
- Reminder: we have to update w and b simultaneously
- Remember for multiple local minimums that you can end up at different local minimums depending on where you initialize the initial position of w & b
- And with squared error cost function the function will ALWAYS have ONE local minimum, it is called convex function also. The local minimum is also know as global minimum
How To Use
- So let’s start with our linear regression model we started a few pages ago and assign w=-0.1 b=900
- Then we take one step from the blue to orange and you see the line has changed
- Take more steps towards the middle and you see the straight line fit till we reach the global minimum shown below using the mustard colored line
Batch Gradient Descent
So if we use the algorithm above to arrive at the mustard line based on the housing prices in the dataset, we in essence use all the training sets in the data and therefore it is called Batch gradient descent Not a subset of the data.
Other models will use subsets of data, but we will use the entire dataset for linear regression models
Code
.
Compute Cost
#Function to calculate the cost
def compute_cost(x, y, w, b):
= x.shape[0]
m = 0
cost
for i in range(m):
= w * x[i] + b
f_wb = cost + (f_wb - y[i])**2
cost = 1 / (2 * m) * cost
total_cost
return total_cost
Compute Gradient
def compute_gradient(x, y, w, b):
"""
Computes the gradient for linear regression
Args:
x (ndarray (m,)): Data, m examples
y (ndarray (m,)): target values
w,b (scalar) : model parameters
Returns
dj_dw (scalar): The gradient of the cost w.r.t. (with respect to) the parameters w
dj_db (scalar): The gradient of the cost w.r.t. the parameter b
"""
# Number of training examples
= x.shape[0]
m = 0
dj_dw = 0
dj_db
for i in range(m):
= w * x[i] + b
f_wb = (f_wb - y[i]) * x[i]
dj_dw_i = f_wb - y[i]
dj_db_i += dj_db_i
dj_db += dj_dw_i
dj_dw = dj_dw / m
dj_dw = dj_db / m
dj_db
return dj_dw, dj_db
plt_gradients(x_train,y_train, compute_cost, compute_gradient) plt.show()
Above, the left plot shows ∂J(w,b)∂w or the slope of the cost curve relative to w at three points. On the right side of the plot, the derivative is positive, while on the left it is negative. Due to the ‘bowl shape’, the derivatives will always lead gradient descent toward the bottom where the gradient is zero.
The left plot has fixed b=100. Gradient descent will utilize both ∂J(w,b)∂w and ∂J(w,b)∂b to update parameters. The ‘quiver plot’ on the right provides a means of viewing the gradient of both parameters. The arrow sizes reflect the magnitude of the gradient at that point. The direction and slope of the arrow reflects the ratio of ∂J(w,b)∂w and ∂J(w,b)∂b at that point. Note that the gradient points away from the minimum. Review equation (3) above. The scaled gradient is subtracted from the current value of w or b. This moves the parameter in a direction that will reduce cost.
Gradient Descent
def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function):
"""
Performs gradient descent to fit w,b. Updates w,b by taking
num_iters gradient steps with learning rate alpha
Args:
x (ndarray (m,)) : Data, m examples
y (ndarray (m,)) : target values
w_in,b_in (scalar): initial values of model parameters
alpha (float): Learning rate
num_iters (int): number of iterations to run gradient descent
cost_function: function to call to produce cost
gradient_function: function to call to produce gradient
Returns:
w (scalar): Updated value of parameter after running gradient descent
b (scalar): Updated value of parameter after running gradient descent
J_history (List): History of cost values
p_history (list): History of parameters [w,b]
"""
= copy.deepcopy(w_in) # avoid modifying global w_in
w # An array to store cost J and w's at each iteration primarily for graphing later
= []
J_history = []
p_history = b_in
b = w_in
w
for i in range(num_iters):
# Calculate the gradient and update the parameters using gradient_function
= gradient_function(x, y, w , b)
dj_dw, dj_db
# Update Parameters using equation (3) above
= b - alpha * dj_db
b = w - alpha * dj_dw
w
# Save cost J at each iteration
if i<100000: # prevent resource exhaustion
J_history.append( cost_function(x, y, w , b))
p_history.append([w,b])# Print cost every at intervals 10 times or as many iterations if < 10
if i% math.ceil(num_iters/10) == 0:
print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e} ",
f"w: {w: 0.3e}, b:{b: 0.5e}")
return w, b, J_history, p_history #return w and J,w history for graphing
# initialize parameters
= 0
w_init = 0
b_init # some gradient descent settings
= 10000
iterations = 1.0e-2
tmp_alpha # run gradient descent
= gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha,
w_final, b_final, J_hist, p_hist
iterations, compute_cost, compute_gradient)print(f"(w,b) found by gradient descent: ({w_final:8.4f},{b_final:8.4f})")
- The cost starts large and rapidly declines as described in the slide from the lecture.
- The partial derivatives,
dj_dw
, anddj_db
also get smaller, rapidly at first and then more slowly. As shown in the diagram from the lecture, as the process nears the ‘bottom of the bowl’ progress is slower due to the smaller value of the derivative at that point. - progress slows though the learning rate, alpha, remains fixed
Cost vs Iterations
A plot of cost versus iterations is a useful measure of progress in gradient descent. Cost should always decrease in successful runs. The change in cost is so rapid initially, it is useful to plot the initial decent on a different scale than the final descent. In the plots below, note the scale of cost on the axes and the iteration step.
# plot cost versus iteration
= plt.subplots(1, 2, constrained_layout=True, figsize=(12,4))
fig, (ax1, ax2) 100])
ax1.plot(J_hist[:1000 + np.arange(len(J_hist[1000:])), J_hist[1000:])
ax2.plot("Cost vs. iteration(start)"); ax2.set_title("Cost vs. iteration (end)")
ax1.set_title('Cost') ; ax2.set_ylabel('Cost')
ax1.set_ylabel('iteration step') ; ax2.set_xlabel('iteration step')
ax1.set_xlabel( plt.show()
Now that we have found the optimal values for w and b we can predict the housing values.
print(f"1000 sqft house prediction {w_final*1.0 + b_final:0.1f} Thousand dollars")
print(f"1200 sqft house prediction {w_final*1.2 + b_final:0.1f} Thousand dollars")
print(f"2000 sqft house prediction {w_final*2.0 + b_final:0.1f} Thousand dollars")
1000 sqft house prediction 300.0 Thousand dollars 1200 sqft house prediction 340.0 Thousand dollars 2000 sqft house prediction 500.0 Thousand dollars
Contour Plot
- You can show the progress of gradient descent during its execution by plotting the cost over iterations on a contour plot of the cost(w,b).
= plt.subplots(1,1, figsize=(12, 6))
fig, ax plt_contour_wgrad(x_train, y_train, p_hist, ax)
- The path makes steady (monotonic) progress toward its goal.
- initial steps are much larger than the steps near the goal.
- If we zoom in, the distance between steps shrinks and the gradient approaches zero
= plt.subplots(1,1, figsize=(12, 4))
fig, ax =[180, 220, 0.5], b_range=[80, 120, 0.5],
plt_contour_wgrad(x_train, y_train, p_hist, ax, w_range=[1,5,10,20],resolution=0.5) contours
Learning Rate
As we discussed above if we start with a large learning rate there is a possibility that we may never converge and possibly diverge.
# initialize parameters
= 0
w_init = 0
b_init # set alpha to a large value
= 10
iterations = 8.0e-1
tmp_alpha # run gradient descent
= gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha,
w_final, b_final, J_hist, p_hist iterations, compute_cost, compute_gradient)
Above, w and b are bouncing back and forth between positive and negative with the absolute value increasing with each iteration. Further, each iteration ∂J(w,b)∂w changes sign and cost is increasing rather than decreasing. This is a clear sign that the learning rate is too large and the solution is diverging. Let’s visualize this with a plot.
plt_divergence(p_hist, J_hist,x_train, y_train) plt.show()
Above, the left graph shows w’s progression over the first few steps of gradient descent. w oscillates from positive to negative and cost grows rapidly. Gradient Descent is operating on both w and b simultaneously, so one needs the 3-D plot on the right for the complete picture.