Linear Regression is one of many supervised machine learning algorithms, and it is mosly used to predict the value of a continuous variable, as well as to do forecasting.
In other words, it can be used:
to see if one variable can be used to predict another variable.
to see if one variable is correlated or dependent with another variable.
However, Linear Regression also comes with some limitations, such as:
It assumes that the relationship between the independent variable and the dependent variable is linear. However, in reality, the relationship between the independent variable and the dependent variable is not always linear.
It assumes that the independent variables are not correlated with each other. However, in reality, the independent variables are not always independent.
It's sensitive to outliers. Meaning that the presence of outliers can affect the regression line.
From the graph above, we can see that the relationship between x and y is linear since the blue line starts from the bottom left to the top right.
That line is called a regression line, and it can be expressed using the following equation.
y^=θ0+θ1x
where θ0 is the intercept and θ1 is the first coefficient.
In high school or college, we are used to seeing the equation above in the following form to calculate the distance between one point to another.
y^=b+ax
In this post, we are going to use two features in the Iris dataset from sklearn-learn, petal width and sepal length.
Plotting it will give us the following visualization.
Iris Dataset Scatter Plot
You should know that the intercept or θ0 is the starting point of the regression line.
Whether the line is going up or down depends on the θ1 and the data.
If θ0=0, it means that our regression line will start from 0.
Expressing the equation like we did above is quite cryptic for people who don't have strong mathematical background.
Since we are using the Iris Dataset, we can translate the equation into a more readable form.
petal width=θ0+θ1×sepal length
From the translation above can tell us the relationship between those two variables.
Now you will be wondering their correlations whether sepal_length and petal_width are correlated or inversely correlated.
First, let's translate what the two graphs below are trying to tell us.
sepal_length and petal_width are said to be correlated when sepal_length increases, the petal_length also increases.
Conversely, sepal_length and petal_width are said to be inversely correlated when sepal_length increases, but the petal_width decreases.
With a regression line, it can help us to predict the y value given a single x value.
However, most predictions made by the regression line are not always accurate
since its ability to predict depends heavily on θ0 and θ1.
If the values θ0 and θ1 are not tweaked correctly, the regression line will sit right far from most data points.
Estimating the Intercept and Coefficient
Previously, I mentioned that y^=θ0+θ1x is used to express the regression line.
To be fair, we can't just look at the data and say, "Ah ha! I can tell that θ0 is 0 and θ1 is 0."
For most of the time, it's not feasible to keep on guessing those values.
Thus, it's better to use an iterative method such as Gradient Descent algorithm.
What the Gradient Descent algorithm does is to update the θ0 and θ1 values based on the cost function and the learning rate.
This example is just a simple linear model, we are going to use the following equations to update intercept and coefficient:
If you are want to know more about Gradient Descent algorithm, you can read the Gradient Descent series here.
The series covers the intuition behind Gradient Descent, the math behind it, and its implementation in Python from scratch.
Example
In this section, there will be two examples of regression lines. One with the regression line is far from most data points, and the other one is close to most data points.
Most data points are far from the regression line
When the regression line, which is indicated by the green line, sees x=0 then it predicts y^=−3.
In reality, y should be 0.5 when x=0. Meaning that the predicted value is far from the actual value.
There are many ways to measure the quality of regression lines, such as:
R Squared
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
However, we are going to use Mean Squared Error (MSE) this time.
MSE=n1i=1∑n(y^i−yi)2
where n is the number of data points, y^i is the predicted value, and yi is the actual value.
Let's calculate the MSE for the graph above with the data points above.
MSE=547.59=9.518
Let's see another example where the data points are close to the regression line.
Most data points are close from the regression line
Let's calculate the MSE for this example to see if the MSE is small when the regression line is close to most data points.
MSE=51.74=0.348
From these two examples, we can see that when the regression line is close to most data points, the MSE is small.
Conversonly, when the regression line is far from most data points, the MSE is large.
Is having a small MSE enough to say that the regression line is good? Futher investigation is needed to answer this question.
However, I am not gonna cover it in this post.
Python Implementation
First prepare the dataset. We are going to use the Iris dataset from sklearn-learn with two features, petal_width and sepal_length.
Note that, the regression line is calculated using the following equation.
y^=θ0+θ1x
where θ0 is the intercept and θ1 is the first coefficient.
In this case, we are going to replace θ0 with b and θ1 with x.
Without any assumption, we are going to start with b=0 and x=0.
Regression line with b=0 and x=0
Looking at the plot above, we can tell that the regression line is far from most data points.
Thus, we need to automatically tweak b and x to get a better regression line using the Gradient Descent algorithm.
After running the Gradient Descent algorithm above for 10,000 times, we get the following regression line where b=−2.71 and x=0.67.
Regression line with b=-2.71 and x=0.67
You would notice there are three lists: mse_history, intercept_history, and coefficient_history.
I made them just to see how the MSE, intercept, and coefficient change over time.
MSE, Intercept, and Coefficient over time
I have also plot the MSE, intercept, and coefficient in the 3D space to see how they change over time.
MSE, Intercept, and Coefficient in 3D space
Note that the blue x marks the lowest MSE value calculated with the estimated intercept and coefficient we got from the Gradient Descent algorithm.
To see how I made this graph, you could check out the source code here.
Conclusion
Linear Regression is a supervised learning algorithm that is used to predict the value of a continuous variable.
The equation of the regression line is y=θ0+θ1×x.
The regression line is said to be good when it is close to most data points.
The quality of the regression line can be measured using evaluation metrics, one of them is Mean Squared Error (MSE).
The smaller the MSE, the closer the regression line is to most data points. Conversely, the larger the MSE, the farther the regression line is to most data points.
b and x can be estimated using Gradient Descent algorithm.
Code
Simple Implementation
Printing b and x will give us -2.717366489030271 and 0.6718570469763597.