In the [Batch Gradient Descent] post, we have discussed that the intercept and the coefficient
are updated after the algorithm has seen the entire dataset.
In this post, we will discuss the Mini-Batch Gradient Descent (MBGD) algorithm.
MBGD is quite similar to BGD, but the only difference is that the parameters are updated after
seeing a subset of the dataset.
Mathematics of Mini-Batch Gradient Descent
The parameter update rule is expressed as
θ=θ−α∇θJ(θ;xi:i+n;yi:i+n)
where
θ is the parameter vector
α is the learning rate
J(θ;xi:i+n;yi:i+n) is the cost function
∇θJ(θ;xi:i+n;yi:i+n) is the gradient of the cost function
xi:i+n is the subset of the dataset
yi:i+n is the subset of the target variable
The gradient of the cost function w.r.t. to the intercept θ0
and the coefficient θ1 are expressed as the following.
First, define the predict and create_batches functions.
Second, split the dataset into mini batches.
Third, determine the prediction error of each mini batch and the gradient of the cost function w.r.t the intercept θ0 and the coefficient θ1.
Lastly, update the intercept θ0 and the coefficient θ1.
Conclusion
The change of the regression line over time with 64 batch sizeThe effect of batch sizes on the cost function
From the graph above, we can see that the cost function line is less noisy, or smoother, when the batch size is larger.
Thus, 50 to 256 is a good range for the batch size.
However, it really depends on the hardware of the machine and the size of the dataset.
The pathway of the cost function over the 2D MSE contour
Unlike BGD, we can see that the MBGD loss function pathway follows a zig-zag pattern while traversing the valley of the MSE contour.
Code
References
Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747 (2016).