Previously in the regression and the gradient descent series,
I have walked through readers on writing linear regression algorithms and optimizers from scratch mathematically and programmatically.
The objective of this post is to rewrite Logistic Regression and Gradient
Descent in C++ to speed up the computation, and we are going to compare
the speed of the Python and C++ implementations.
First and foremost, we are going to need a library called Eigen, similar to Numpy.
We can install it by running:
Then, we are going to need CMakeLists.txt and create a new folder called build in the parent working directory.
Most importantly, the eigen3 location has to be included in the CMakeLists.txt file.
Once we have the CMakeLists.txt file, we can run the following commands:
You'd see a main executable file in the build directory. Then,
CSV Reader
First, we are going to make an object called Iris using struct.
The reason why I used struct instead of class object is everything in struct is public by default.
Besides, we only want to retain values. I personally think struct is good enough without overcomplicating the code.
Then, we would need a class called IrisReader.
There are two things we should pay attention to: constructor and deconstructor.
constructor is a member function that is going to be run when the class is initialized.
deconstructor is a member function that is going to be run when the class is out of scope,
or when the class has reached its end of its execution.
What the IrisHeader class does is that it reads a CSV file passed when it's being initialized.
If the load() function is called, then read the CSV file line by line and store the values in a vector called flowers.
Finally, close the file when the class has reached the end of its execution.
To see if it's working, initialize an IrisReader object and call the load() function.
Since we are working on Logistic Regression, which is good at binary classification, we are going to modify the target values.
If the target value is setosa then it's going to be 0, otherwise 1. This is how you prepare the dataset in Python:
Implementation
C++
As for the Python implementation, please refer to this Logistic Regression implementation post
that I have written previously. In this post, we are going to focus more on the C++ implementation.
First, let's define a class called LogisticRegression with its constructor and deconstructor.
Remember constructor is a member function that is going to be run when the class is initialized.
While deconstructor is a member function that is going to be run when the class is out of scope.
Now, we should make fit(), predict(), and loss()
so that users can interact with the model object easily.
Let's extend the class above.
After that, we need to create three private functions for the class:
sigmoid(), threshold(), and optimize().
The C++ implementation only one file: main.cpp.
I could have separated the implementation into multiple files,
but I wanted to keep it simple and easy to understand
for the sake of experimentation.
Python
Here is the structure of the Python implementation:
Let's implement the LogisticRegression class
in logistic_regression.py.
Comparison
Time
To measure the execution time of the Python implementation, I made a Python function decorator called time_it
in linear_regression.py.
To measure the execution time of C++ implementation,
we can use chrono.
Lines of Code
To determine the number of lines of code (LoC),
we can use the following commands:
# For C++
find . -name '*.cpp' | xargs wc -l
# For Python
find . -name '*.cpp' | xargs wc -l
C++
Python
It requires 171 LoC to implement Logistic Regression in C++.
While, it requires 59 LoC to implement Logistic Regression in Python.
Conclusion
In this post, we have implemented Logistic Regression both
in C++ and Python. From the execution time graph,
we can see the execution time of C++ is less than one third until
10,000,000 iterations. After that, the difference of execution time
between two implementations just differs by ± 1/3.
Besides the insignificant difference of execution time,
it takes a relatively longer time to write C++ code since
programmers should define the data types of all functions and variables.
Not only that, but the C++ code is also more verbose than Python code
and a little knowledge of CMake is required to compile the code.
With C++, we might end up with a lot of prerequisites knowledge and
other overheads.
Due to the simplicity of Python, we can see why most of
the Machine Learning libraries are written in Python.
It's easier to make prototypes and estimate model
parameters in Python by giving up a little bit of performance.