In the Batch Gradient Descent and
Mini-Batch Gradient Descent posts,
we have learned that the algorithm updates parameters θ0 and θ1 after it has seen the entire dataset and a fraction of the dataset respectively.
With these two algorithms, we would need a certain amount of RAM so that the algorithms can converge.
In this post, we are going to learn about Stochastic Gradient Descent, or SGD for short.
The only difference between SGD and the other previous algorithms we have covered is that
it updates parameters after it has seen a random data point, as the name suggests.
Since it only requires a single data point, the need for RAM decreases, and it could help the algorithm converges faster with the least amount of memory.
Mathematics of Stochastic Gradient Descent
The parameter update rule is expressed as
θ=θ−α∇θJ(θ)
where
θ is the parameter vector
α is the learning rate
J(θ) is the cost function
∇θJ(θ) is the gradient of the cost function J(θ)
The gradient of the cost function w.r.t. to the intercept θ0
and the coefficient θ1 is expressed as the following.
∇θ0J(θ)∇θ1J(θ)=(y^i−yi)=(y^i−yi)⋅xi
Notice that the gradient of the cost function is the prediction error.
Second, get a random number between 0 and the
length of the dataset and predict the value of the random x.
Third, determine the error of the prediction and
update the intercept θ0 and the coefficient θ1.
Lastly, update the intercept and the coefficient.
Conclusion
SGD vs BGD Pathways
You would notice that the pathway of Vanilla Gradient is much more smoother,
and the pathway of SGD is much more thicker than Vanilla Gradient Descent pathway.
If we zoom in, we would notice that the SGD pathway is much more noisier.
Zoomed in SGD vs BGD Pathways
There are pros and cons for Vanilla Gradient Descent and SGD.
For Vanilla Gradient Descent, it is slow but it is guaranteed to converge to the global minimum.
In addition to that, it requires more memory and is not suitable for large dataset since datasets have to be loaded to the RAM before training.
On the otherhand, SGD is faster and is suitable for large dataset since it only requires a single data point to be loaded to the RAM.
However, it's not guaranteed to converge to the global minima since it updates parameters θ0 and θ1 after seeing a random data point.
Not only that, it would struggle to escape local minima and avoid steep regions in the 3D plot of the cost function.
Code
References
Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747 (2016).