In the previous post about, we discussed the Stochastic Gradient Descent algorithm.
Remember that SGD helps us to minimize the cost function using one random data point at a time.
In some cases, it would help us to converge relatively faster and reduces the RAM usage significantly compared to the Vanilla Gradient Descent algorithm.
However, the SGD algorithm has some limitations.
It oscillates a lot, still requires a longer time to converge, and sometimes it may not converge at all because not be able to escape the local minima.
With the help of Momentum, we can overcome these limitations.
The 3D contour of the MSE of Iris Dataset
Imagine rolling a ball down a hill. As it moves, it gains momentum, rolling faster until it reaches
its maximum speed. Clearly, the ball's current speed is determined based on it's previous speed.
In addition to that, if it's contour is smooth, the ball will roll faster and faster until it reaches the bottom of the hill.
If it's contour is rough, the ball will oscillate a lot and reach the bottom of the hill slowly.
The same concept applies to SGD with Momentum.
Mathematics of SGD with Momentum
The parameter update rule is expressed as
vi,tθi=γvi,t−1+α∇θJ(θ)=θi−vi,t
where
vi,t is the i-th Momentum vector at time t
γ is the momentum coefficient
α is the learning rate
∇J(θ) is the gradient of the cost function J(θ)
θi is the i-th parameter
The gradient of the cost function w.r.t. to the intercept θ0
and the coefficient θ1 are expressed as the following.
With the Momentum coefficient, we smooth out the updates.
If gradients point in the same direction, meaning having the same sign (positive or negative), the Momentum vector will take larger steps.
If gradients point in different directions, the Momentum coefficient γ will help reduce oscillations.
In most cases, it is set to 0.9.
Implementation of SGD with Momentum
First, determine the prediction error and the gradient of the cost function w.r.t the intercept θ0 and the coefficient θ1.
Next, we have to update the Momentum vectors.
For simplicity, we are not going to store the vector of θ0 and θ1 into a list.
Instead, we are storing them into separate variables, t0_vector and t1_vector respectively.
Since the updated t0_vector relies on the previous t0_vector, we need to initialize them to zero outside the loop.
Finally, update the parameters.
Conclusion
The loss function pathway of SGD with and without Momentum
From the image above, we can see that the loss function pathway of SGD with Momentum is much more like a ball rolling down a hill.
On the otherhand, Vanilla SGD oscillates a lot while traversing through the valley compared to SGD with Momentum.
Code
References
Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747 (2016).