Gradient Descent - Data Science from Scratch: First Principles with Python (2015)
Data Science from Scratch: First Principles with Python (2015)
Chapter 8.Gradient Descent
Those who boast of their descent, brag on what they owe to others.
Seneca
Frequently when doing data science, we’ll be trying to the find the best model for a certain situation.And usually “best” will mean something like “minimizes the error of the model” or “maximizes the likelihood of the data.”In other words, it will represent the solution to some sort of optimization problem.
This means we’ll need to solve a number of optimization problems. And in particular, we’ll need to solve them from scratch. Our approach will be a technique calledgradient descent, which lends itself pretty well to a from-scratch treatment. You might not find it super exciting in and of itself, but it will enable us to do exciting things throughout the book, so bear with me.
The Idea Behind Gradient Descent
Suppose we have some functionfthat takes as input a vector of real numbers and outputs a single real number. One simple such function is:
defsum_of_squares(v):
"""computes the sum of squared elements in v"""
returnsum(v_i**2forv_iinv)
We’ll frequently need to maximize (or minimize) such functions. That is, we need to find the inputvthat produces the largest (or smallest) possible value.
For functions like ours,thegradient(if you remember your calculus, this is the vector of partial derivatives) gives the input direction in which the function most quickly increases. (If you don’t remember your calculus, take my word for it or look it up on the Internet.)
Accordingly, one approach to maximizing a function is to pick a random starting point, compute the gradient, take a small step in the direction of the gradient (i.e., the direction that causes the function to increase the most), and repeat with the new starting point.Similarly, you can try to minimize a function by taking small steps in theoppositedirection, as shown inFigure 8-1.
Figure 8-1.Finding a minimum using gradient descent
NOTE
If a function has a unique global minimum, this procedure is likely to find it. If a function has multiple (local) minima, this procedure might “find” the wrong one of them, in which case you might re-run the procedure from a variety of starting points. If a function has no minimum, then it’s possible the procedure might go on forever.
Estimating the Gradient
Iffis a function of one variable,its derivative at a pointxmeasures howf(x)changes when we make a very small change tox. It is defined as the limit of the difference quotients:
defdifference_quotient(f,x,h):
return(f(x+h)-f(x))/h
ashapproaches zero.
(Many a would-be calculus student has been stymied by the mathematical definition of limit. Here we’ll cheat and simply say that it means what you think it means.)
Figure 8-2.Approximating a derivative with a difference quotient
The derivative is the slope of the tangent line at, while the difference quotient is the slope of the not-quite-tangent line that runs through. Ashgets smaller and smaller, the not-quite-tangent line gets closer and closer to the tangent line (Figure 8-2).
For many functions it’s easy to exactly calculate derivatives. For example, thesquarefunction:
defsquare(x):
returnx*x
has the derivative:
defderivative(x):
return2*x
which you can check — if you are so inclined — by explicitly computing the difference quotient and taking the limit.
What if you couldn’t (or didn’t want to) find the gradient? Although we can’t take limits in Python, we can estimate derivatives by evaluating the difference quotient for a very smalle.Figure 8-3shows the results of one such estimation:
plt.plot(x,map(derivative,x),'rx',label='Actual')# red x
plt.plot(x,map(derivative_estimate,x),'b+',label='Estimate')# blue +
plt.legend(loc=9)
plt.show()
Figure 8-3.Goodness of difference quotient approximation
Whenfis a function of many variables, ithas multiplepartial derivatives, each indicating howfchanges when we make small changes in just one of the input variables.
We calculate itsith partial derivative by treating it as a function of just itsith variable, holding the other variables fixed:
defpartial_difference_quotient(f,v,i,h):
"""compute the ith partial difference quotient of f at v"""
w=[v_j+(hifj==ielse0)# add h to just the ith element of v
forj,v_jinenumerate(v)]
return(f(w)-f(v))/h
after which we can estimate the gradient the same way:
defestimate_gradient(f,v,h=0.00001):
return[partial_difference_quotient(f,v,i,h)
fori,_inenumerate(v)]
NOTE
A major drawback to this “estimate using difference quotients” approach is that it’s computationally expensive. Ifvhas lengthn,estimate_gradienthas to evaluatefon2ndifferent inputs. If you’re repeatedly estimating gradients, you’re doing a whole lot of extra work.
Using the Gradient
It’s easy to see that thesum_of_squaresfunction is smallest when its inputvis a vector of zeroes. But imagine we didn’t know that. Let’s use gradients to find the minimum among all three-dimensional vectors. We’ll just pick a random starting point and then take tiny steps in the opposite direction of the gradient until we reach a point where the gradient is very small:
defstep(v,direction,step_size):
"""move step_size in the direction from v"""
return[v_i+step_size*direction_i
forv_i,direction_iinzip(v,direction)]
defsum_of_squares_gradient(v):
return[2*v_iforv_iinv]
# pick a random starting point
v=[random.randint(-10,10)foriinrange(3)]
tolerance=0.0000001
whileTrue:
gradient=sum_of_squares_gradient(v)# compute the gradient at v
next_v=step(v,gradient,-0.01)# take a negative gradient step
ifdistance(next_v,v)<tolerance:# stop if we're converging
break
v=next_v# continue if we're not
If you run this, you’ll find that it always ends up with avthat’s very close to[0,0,0]. The smaller you make thetolerance, the closer it will get.
Choosing the Right Step Size
Although the rationale for moving against the gradient is clear, how far to move is not.Indeed, choosing the right step size is more of an art than a science. Popular options include:
§ Using a fixed step size
§ Gradually shrinking the step size over time
§ At each step, choosing the step size that minimizes the value of the objective function
The last sounds optimal but is, in practice, a costly computation. We can approximate it by trying a variety of step sizes and choosing the one that results in the smallest value of the objective function:
It is possible that certain step sizes will result in invalid inputs for our function. So we’ll need to create a “safe apply” function that returns infinity (which should never be the minimum of anything) for invalid inputs:
defsafe(f):
"""return a new function that's the same as f,
except that it outputs infinity whenever f produces an error"""
defsafe_f(*args,**kwargs):
try:
returnf(*args,**kwargs)
except:
returnfloat('inf')# this means "infinity" in Python
returnsafe_f
Putting It All Together
In the general case, we have sometarget_fnthat we want to minimize, and we also have itsgradient_fn.For example, thetarget_fncould represent the errors in a model as a function of its parameters, and we might want to find the parameters that make the errors as small as possible.
Furthermore, let’s say we have (somehow) chosen a starting value for the parameterstheta_0. Then we can implement gradient descent as:
target_fn=safe(target_fn)# safe version of target_fn
value=target_fn(theta)# value we're minimizing
whileTrue:
gradient=gradient_fn(theta)
next_thetas=[step(theta,gradient,-step_size)
forstep_sizeinstep_sizes]
# choose the one that minimizes the error function
next_theta=min(next_thetas,key=target_fn)
next_value=target_fn(next_theta)
# stop if we're "converging"
ifabs(value-next_value)<tolerance:
returntheta
else:
theta,value=next_theta,next_value
We called itminimize_batchbecause, for each gradient step, it looks at the entire data set (becausetarget_fnreturns the error on the whole data set). In the next section, we’ll see an alternative approach that only looks at one data point at a time.
Sometimes we’ll instead want tomaximizea function, which we can do by minimizingits negative (which has a corresponding negative gradient):
defnegate(f):
"""return a function that for any input x returns -f(x)"""
As we mentioned before, often we’ll be using gradientdescent to choose the parameters of a model in a way that minimizes some notion of error. Using the previous batch approach, each gradient step requires us to make a prediction and compute the gradient for the whole data set, which makes each step take a long time.
Now, usually these error functions areadditive, which means that the predictive error on the whole data set is simply the sum of the predictive errors for each data point.
When this is the case, we can instead apply a technique calledstochastic gradient descent, which computes the gradient (and takes a step) for only one point at a time. It cycles over our data repeatedly until it reaches a stopping point.
During each cycle, we’ll want to iterate through our data in a random order:
defin_random_order(data):
"""generator that returns the elements of data in random order"""
indexes=[ifori,_inenumerate(data)]# create a list of indexes
random.shuffle(indexes)# shuffle them
foriinindexes:# return the data in that order
yielddata[i]
And we’ll want to take a gradient step for each data point. This approach leaves the possibility that we might circle around near a minimum forever, so whenever we stop getting improvements we’ll decrease the step size and eventually quit:
§ Keep reading! We’ll be using gradient descent to solve problems throughout the rest of the book.
§ At this point, you’re undoubtedly sick of me recommending that you read textbooks. If it’s any consolation,Active Calculusseems nicer than the calculus textbooks I learned from.
§ scikit-learn has aStochastic Gradient Descent modulethat is not as general as ours in some ways and more general in other ways. Really, though, in most real-world situations you’ll be using libraries in which the optimization is already taken care of behind the scenes, and you won’t have to worry about it yourself (other than when it doesn’t work correctly, which one day, inevitably, it won’t).