In the previous post about SGD with Momentum,
we discussed how Momentum mimics the behavior of a ball rolling down a hill.
Thus, it helps in reducing the oscillations and accelerates the convergence of the model.
In this post, we will discuss a more efficient version of Stochastic
Gradient Descent with Momentum, called Nesterov Accelerated Gradient,
which does not blindly following the gradient but also consider the future gradient.
From this now on, I am gonna call Nesterov Accelerated Gradient as NAG.
Mathematics of NAG
The parameter update rule is expressed as
vi,tθi=γvi,t−1+α∇J(θi−γvi,t−1)=θi−v0,t
where
vi,t is the i-th Momentum vector at time t
γ is the momentum coefficient
α is the learning rate
∇J(θi−γvt−1) is the gradient of the cost function at the point θi−γvi,t−1, or the lookahead point
θi is the parameter vector
The gradient of the cost function w.r.t. to the intercept θ0
and the coefficient θ1 are expressed as the following.
Similar to Momentum, NAG also helps in reducing the oscillations and accelerates the convergence of the model.
However, NAG is more "conscience" and efficient than Momentum because it anticipates the future gradient and thus converges faster.
Implementation of NAG
First, calculate the lookahead intercept θ0−γvi,t−1
and the lookahead coefficient θ1−γvi,t−1
so that we can determine the lookahead prediction.
Second, determine the value of the gradient of the cost function at the point
θi−γvi,t−1, ∇J(θi−γvi,t−1).
Basically, it's the same as the gradient of the cost function w.r.t. to the parameters θ0 and θ1.
The only difference is that we are using the lookahead prediction instead of the current prediction.
Notice that the gradient of the cost function w.r.t. to the intercept θ0 is
the prediction error. We can use that to speed up the computation.
Third, update the Momentum vectors.
For simplicity, we are not going to store the vector of θ0 and θ1 into a list.
Instead, we are storing them into separate variables, t0_vector and t1_vector respectively.
Since the updated t0_vector relies on the previous t0_vector, we need to initialize them to zero outside the loop.
Finally, update the parameters.
Conclusion
The loss function pathways of SGD, SGD with Momentum, and SGD with Nesterov
From the graph above, we can see that NAG oscillates less by anticipating the future gradient unlike Vanilla SGD and SGD with Momentum.
The path of the loss function of NAG seems to be more direct and natural, just like a ball rolling down a hill.
Code
References
Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747 (2016).