RMSprop

In the Adadelta post, we have discussed the limitations of the Adagrad algorithm, the rapid decay of the learning rate. Adadelta solves limitation by calculating the average squared of the decayed gradients over time.

RMSprop also solves the limitation of Adagrad just like Adadelta. Unlike Adadelta The only difference is that RMSprop uses the learning rate value, and divides it with the root mean squared of the decayed gradients. This algorithm was proposed independently by Geoffrey Hinton around the same time as Adagrad. However, this algorithm was never published.

Mathematics of RMSprop

Collectively,

E[g^2]_t

, the decayed gradients up to time

t

, can be expressed as follows:

Since there are two parameters, we need determine

g_{0,t}

is the gradient of the cost function at time

t

w.r.t. to the intercept

\theta_0

, and

g_{1,t}

is the gradient of the cost function at time

t

w.r.t. to the coefficient

\theta_1

Implementation of RMSprop

First, First, calculate the intercept and the coefficient gradient. Notice that the intercept gradient

g_{t,0}

is the predicion error.

1
error = prediction - y[random_index]
2
new_intercept_gradient = error
3
new_coefficient_gradient = error * x[random_index]

Second, calculate the running average of the squared gradients

E[g^2]_t

E[g^2]_{t-1}

can be written as np.mean(df['intercept_gradient'].values ** 2).

1
mean_squared_intercept = (0.9 * np.mean(df['intercept_gradient'].values ** 2)) + (0.1 * new_intercept_gradient ** 2)
2
mean_squared_coefficient = (0.9 * np.mean(df['coefficient_gradient'].values ** 2)) + (0.1 * new_coefficient_gradient ** 2)

Lastly, update the parameters immediately using the calculation we have done above.

1
intercept -= (learning_rate / np.sqrt(mean_squared_intercept + eps)) * new_intercept_gradient
2
coefficient -= (learning_rate / np.sqrt(mean_squared_coefficient + eps)) * new_coefficient_gradient

Conclusion

From the graph, we can see that Adagrad struggles to reach the bottom of the cost function in 150 iterations. Unlike Adagrad, Adadelta and RMSprop can reach the bottom of the cost function easily. However, the pathway of Adadelta is noisy compared to RMSprop, and RMSprop is more stable and direct compared to the other two algorithms.

Code

1
def rmsprop(x, y, df, epoch=150, learning_rate=0.01, eps=1e-8):
2
  intercept, coefficient = -0.5, -0.75
3

4
  random_index = np.random.randint(len(x))
5
  prediction = predict(intercept, coefficient, x[random_index])
6
  error = prediction - y[random_index]
7
  mse = (error ** 2) / 2
8
  df.loc[0] = [intercept, coefficient, error, error * x[random_index], mse]
9

10
  for epoch in range(1, epoch + 1):
11
    random_index = np.random.randint(len(x))
12
    prediction = predict(intercept, coefficient, x[random_index])
13
    error = prediction - y[random_index]
14

15
    new_intercept_gradient = error
16
    new_coefficient_gradient = error * x[random_index]
17

18
    mean_squared_intercept = (0.9 * np.mean(df['intercept_gradient'].values ** 2)) + (0.1 * new_intercept_gradient ** 2)
19
    mean_squared_coefficient = (0.9 * np.mean(df['coefficient_gradient'].values ** 2)) + (0.1 * new_coefficient_gradient ** 2)
20

21
    intercept -= (learning_rate / np.sqrt(mean_squared_intercept + eps)) * new_intercept_gradient
22
    coefficient -= (learning_rate / np.sqrt(mean_squared_coefficient + eps)) * new_coefficient_gradient
23

24
    mse = ((prediction - y[random_index]) ** 2) / 2
25
    df.loc[epoch] = [intercept, coefficient, new_intercept_gradient, new_coefficient_gradient, mse]
26

27
  return df

RMSprop

Gradient Descent Algorithms (Series)

Introduction

Mathematics of RMSprop

Implementation of RMSprop

Conclusion

Code

References