In the Adadelta post, we have discussed the limitations of the Adagrad algorithm,
the rapid decay of the learning rate. Adadelta solves limitation by calculating the average squared of the decayed gradients over time.
RMSprop also solves the limitation of Adagrad just like Adadelta.
Unlike Adadelta The only difference is that RMSprop uses the learning rate value, and divides it with the root mean squared of the decayed gradients.
This algorithm was proposed independently by Geoffrey Hinton around the same time as Adagrad.
However, this algorithm was never published.
Mathematics of RMSprop
The parameter update rule is expressed as
θt+1=θt−E[g2]t+ϵα⊙gt
where
θt is the parameter at time t
α is the learning rate
E[g2]t+ϵ is the root mean squared of the decayed gradients up to time t
Collectively, E[g2]t, the decayed gradients up to time t, can be expressed as follows:
Since there are two parameters, we need determine
g0,t is the gradient of the cost function at time t w.r.t. to the intercept θ0,
and g1,t is the gradient of the cost function at time t w.r.t. to the coefficient θ1.
First, First, calculate the intercept and the coefficient gradient.
Notice that the intercept gradient gt,0 is the predicion error.
Second, calculate the running average of the squared gradients E[g2]t.
E[g2]t−1 can be written as np.mean(df['intercept_gradient'].values ** 2).
Lastly, update the parameters immediately using the calculation we have done above.
Conclusion
Pathways of Adagrad, Adadelta, and RMSprop along the 2D MSE contour
From the graph, we can see that Adagrad struggles to reach the bottom of the cost function in 150 iterations.
Unlike Adagrad, Adadelta and RMSprop can reach the bottom of the cost function easily.
However, the pathway of Adadelta is noisy compared to RMSprop, and RMSprop is more stable and direct compared to the other two algorithms.
Code
References
Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747 (2016).