Remember from the SGD with Nesterov post, we could
minimize the cost function effeciently with less oscillations
by taking consideration of the future gradient.
The pathway to the local minima resembles a ball going down a hill in real life, much better than SGD with Momentum.
However, the learning rate is still fixed for all parameters.
In this post, we will discuss the Adagrad optimization algorithm,
which could help us to adapt the learning rate for each parameter.
Meaning, each parameter will have its own learning rate. In addition to that,
parameters that are updated frequently will experience smaller updates.
While the parameters that are updated infrequently will experience larger updates.
Mathematics of Adagrad
The parameter update rule is expressed as
θt+i,i=θt−Gt,ii+ϵα⊙gtgt,i=∇θtJ(θt,i)
where
θt+i,i is the i-th parameter at time t+i
α is the learning rate
Gt,ii is the sum of the squares of the gradients up to time t
ϵ is a smoothing term to avoid division by zero
⊙ is the element-wise product
gt,i is the gradient of the i-th parameter at time t
Since we only have two parameters, we are going to need gt,0
to represent the gradient of the cost function with respect to the intercept,
and gt,1 to represent the gradient of the cost function with respect to the coefficient.
These two can be expressed as follows:
Notice in the parameter update rule, Adagrad eliminates the need to manually
tune the learning rate for each parameter by dividing the learning rate by the
square root of the sum of the squares of the gradients up to time t.
In most cases, the learning rate α value can be set to 0.01.
From the equation above, we could see that the learning rate is divided by Gt,ii+ϵ.
Meaning that, the learning rate will decrease rapidly as Gt,ii increases.
Note that Gt,ii increases as the number of iterations increases.
Thus, the parameters that are updated infrequently will experience larger updates and vice versa.
Implementation of Adagrad
First, calculate the intercept and the coefficient gradients.
Notice that gt,0 is just the prediction error at time t.
Second, calculate the sum of the squares of the gradients up
to time tGt,ii and accumulate it over time.
Finally, update the intercept and the coefficient.
Conclusion
Pathways of SGD and Adagrad along the 2D MSE contour.
You would notice that SGD have reached the bottom of the valley faster than Adagrad in less than 100 iterations.
Unlike SGD, Adagrad required more iterations to reach the bottom of the valley.
The reason is Adagrad's aggressive learning rate decay over time.
In other words, the learning rate decreases as the number of iterations increases.
Let's look at the following part in the Adagrad's parameter update equation:
Gt,ii+ϵα
Remember that Gt,ii represents accumulated_squared_intercept_gradient and accumulated_squared_coefficient_gradient up to time t.
As the number of iterations increases, the accumulated sum increases, and the learning rate would decrease significantly over time.
For Adagrad to reach the bottom of the valley faster, epochs should be set to 10,000.
Code
References
Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747 (2016).
Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad with SGD: Efficient Learning of Descent Directions. arXiv:1802.09568 (2018).
John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research (2011).