BSR

Machine Learning

30 June 2024

Regression Algorithms in C++ and Python

Introduction

Previously in the regression and the gradient descent series, I have walked through readers on writing linear regression algorithms and optimizers from scratch mathematically and programmatically.

The objective of this post is to rewrite Logistic Regression and Gradient Descent in C++ to speed up the computation, and we are going to compare the speed of the Python and C++ implementations.

First and foremost, we are going to need a library called Eigen, similar to Numpy. We can install it by running:

sudo apt install libeigen3-dev

Then, we are going to need CMakeLists.txt and create a new folder called build in the parent working directory. Most importantly, the eigen3 location has to be included in the CMakeLists.txt file.

cmake_minimum_required(VERSION 3.10)
 
project(MachineLearning)
 
set(EIGEN_INCLUDE_DIR /usr/include/eigen3)
 
include_directories(${EIGEN_INCLUDE_DIR})
 
add_executable(main main.cpp optimizer.cpp regression.cpp)
 
set(CMAKE_CXX_COMPILER "g++")
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_VERBOSE_MAKEFILE ON)

Once we have the CMakeLists.txt file, we can run the following commands:

cd build
cmake ..
make

You'd see a main executable file in the build directory. Then,

./main

CSV Reader

First, we are going to make an object called Iris using struct.

struct Iris
{
    float sepal_length;
    float sepal_width;
    float petal_length;
    float petal_width;
    float target;
};

The reason why I used struct instead of class object is everything in struct is public by default. Besides, we only want to retain values. I personally think struct is good enough without overcomplicating the code.

Then, we would need a class called IrisReader.

class IrisReader
{
public:
    vector<Iris> flowers;
    ifstream file;
 
    # constructor
    IrisReader(const string &filename)
    {
        file.open(filename);
    }
 
    # deconstructor
    ~IrisReader()
    {
        file.close();
    }
 
    vector<Iris> load()
    {
        string line;
        for (auto i=0; getline(file, line); i++)
        {
            Iris iris;
            sscanf(
                line.c_str(),
                "%f,%f,%f,%f,%f",
                &iris.sepal_length,
                &iris.sepal_width,
                &iris.petal_length,
                &iris.petal_width,
                &iris.target
            );
            flowers.push_back(iris);
        }
 
        return flowers;
    }
}

There are two things we should pay attention to: constructor and deconstructor. constructor is a member function that is going to be run when the class is initialized. deconstructor is a member function that is going to be run when the class is out of scope, or when the class has reached its end of its execution.

What the IrisHeader class does is that it reads a CSV file passed when it's being initialized. If the load() function is called, then read the CSV file line by line and store the values in a vector called flowers. Finally, close the file when the class has reached the end of its execution.

To see if it's working, initialize an IrisReader object and call the load() function.

int main() {
    IrisReader reader("iris.csv");
    vector<Iris> iris = raader.load();
 
    for (Iris &i : iris) {
        cout << i.sepal_length << " "
            << i.sepal_width << " "
            << i.petal_length << " "
            << i.petal_width << " "
            << i.target << endl;
    }
 
    return 0;
}

Since we are working on Logistic Regression, which is good at binary classification, we are going to modify the target values. If the target value is setosa then it's going to be 00, otherwise 11. This is how you prepare the dataset in Python:

import pandas as pd
from sklearn.datasets import load_iris
 
iris = load_iris()
 
X = iris.data
y = iris.target
 
is_setosa = [1.0 if i == 0 else 0.0 for i in y]
 
df = pd.DataFrame(data= X, columns= iris.feature_names)
df['is_setosa'] = is_setosa
 
df.to_csv("./iris.csv", index = False)

Implementation

C++

As for the Python implementation, please refer to this Logistic Regression implementation post that I have written previously. In this post, we are going to focus more on the C++ implementation.

First, let's define a class called LogisticRegression with its constructor and deconstructor.

Remember constructor is a member function that is going to be run when the class is initialized. While deconstructor is a member function that is going to be run when the class is out of scope.

class LogisticRegression
{
public:
    float learning_rate;
    int iterations;
    float intercept = 0.0;
    Vector4f coefficients = Vector4f::Zero();
 
    # Constructor
    LogisticRegression(float learning_rate = 0.01, int iterations = 100) : learning_rate(learning_rate), iterations(iterations)
    {
    }
 
    # Deconstructor
    ~LogisticRegression()
    {
        cout << "Intercept: "
            << intercept
            << endl;
 
        cout << "Coefficients: "
            << "["
            << coefficients(0) << ", "
            << coefficients(1) << ", "
            << coefficients(2) << ", "
            << coefficients(3)
            << "]"
            << endl;
    }
}

Now, we should make fit(), predict(), and loss() so that users can interact with the model object easily. Let's extend the class above.

class LogisticRegression
{
public:
    ...
    void fit(vector<Iris> flowers)
    {
        int rows = flowers.size();
        MatrixXf X(rows, 4);
        VectorXf y(rows);
 
        for (int i=0; i<rows; i++)
        {
            X(i, 0) = flowers[i].sepal_length;
            X(i, 1) = flowers[i].sepal_width;
            X(i, 2) = flowers[i].petal_length;
            X(i, 3) = flowers[i].petal_width;
            y(i) = flowers[i].target;
        }
 
        optimize(X, y);
    }
 
    RowVectorXf predict(float intercept, VectorXf coefficients)
    {
        RowVectorXf predictions = RowVectorXf::Constant(features.rows(), intercept_value) + RowVectorXf(features * coefficients);
 
        for (float &prediction : predictions)
        {
            prediction = threshold(sigmoid(prediction));
        }
 
        return predictions
    }
 
    float loss(RowVectorXf predictions, RowVectorXf trueValues)
    {
        RowVectorXf error = RowVectorXf(predictions - targets);
        RowVectorXf squared_error = error.array().pow(2);
        return squared_error.sum() / targets.size();
    }
    ...
}

After that, we need to create three private functions for the class: sigmoid(), threshold(), and optimize().

class LogisticRegression
{
public:
    ...
private:
    float sigmoid(float x)
    {
        return 1 / (1 + exp(-x));
    }
 
    float threshold(float prediction)
    {
        return prediction >= 0.5 ? 1 : 0;
    }
 
    void optimize(MatrixX4f X, VectorXf y)
    {
        float length = X.rows();
        for (int i = 0; i < iterations)
    }
}

The C++ implementation only one file: main.cpp. I could have separated the implementation into multiple files, but I wanted to keep it simple and easy to understand for the sake of experimentation.

Python

Here is the structure of the Python implementation:

.
├── Regressions
│   ├── __init__.py
│   ├── linear_regression.py
│   └── logistic_regression.py
└── main.py

Let's implement the LogisticRegression class in logistic_regression.py.

import numpy as np
 
class LogisticRegression:
    def __init__(self, epochs: int = 100, learning_rate: float = 0.01):
        self.iterations: int = 1_000
        self.intercept: float = 0.0
        self.coefficients: list = list()
        self.optimizer: str = "sgd"
        self.loss_history: list = list()
        self.epochs: int = epochs
        self.learning_rate: float = learning_rate
 
    def intercept_(self):
        return self.intercept
    
    def coef_(self):
        return self.coefficients
    
    def loss_history_(self):
        return self.loss_history
    
    def sigmoid(self, x: float):
        return 1 / (1 + np.exp(-x))
    
    def linear_function(self, intercept: float, coefficients: np.array, x):
        return intercept + np.dot(coefficients, x.T)
    
    def threshold(self, x):
        return np.where(x > 0.5, 1, 0)
    
    def predict(self, intercept: float, coefficients: np.array, x: np.array):
        linear_predictions = intercept + np.dot(coefficients, x.T)
        return self.threshold(self.sigmoid(linear_predictions)).flatten()
 
    def fit(self, X, y):
        length = len(y)
        self.coefficients = np.zeros((1, X.shape[1]))
 
        for i in range(self.epochs):
            predictions = self.predict(self.intercept, self.coefficients, X)
            predictions_error = (predictions - y)
            self.intercept = self.intercept - self.learning_rate * sum(predictions_error) / length
            self.coefficients = self.coefficients - ((self.learning_rate * np.dot(predictions_error, X)) / length)
            self.loss_history.append(
                np.sum(predictions_error ** 2) / length
            )

Comparison

Time

To measure the execution time of the Python implementation, I made a Python function decorator called time_it in linear_regression.py.

...
from time import time
from functools import wraps
 
def time_it(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time()
        result = func(*args, **kwargs)
        end = time()
        print(f"Function {func.__name__} took {end - start} seconds")
        return result
    return wrapper
 
class LogisticRegression:
    ...
 
    @time_it
    def fit(self, X, y):
        ...

To measure the execution time of C++ implementation, we can use chrono.

#inlcude <chrono>
 
int main()
{
    ...
    vector<int> iterations = {
        1, 10, 100, 1000, 10000, 100000, 1000000
    }
 
    for (int iter : iterations)
    {
        auto start = chrono::high_resolution_clock::now();
        ...
        auto stop = chrono::high_resolution_clock::now();
        cout << iter << " iteration(s) took " << duration<double>(stop - start).count() << "s\n";
    }
 
    ...
}

Lines of Code

To determine the number of lines of code (LoC), we can use the following commands:

# For C++
find . -name '*.cpp' | xargs wc -l

# For Python
find . -name '*.cpp' | xargs wc -l

C++

...
171 ./main.cpp
...

Python

59 ./Regressions/logistic_regression.py
...
22 ./main.py
...

It requires 171 LoC to implement Logistic Regression in C++. While, it requires 59 LoC to implement Logistic Regression in Python.

Conclusion

In this post, we have implemented Logistic Regression both in C++ and Python. From the execution time graph, we can see the execution time of C++ is less than one third until 10,000,000 iterations. After that, the difference of execution time between two implementations just differs by ± 1/3.

Besides the insignificant difference of execution time, it takes a relatively longer time to write C++ code since programmers should define the data types of all functions and variables. Not only that, but the C++ code is also more verbose than Python code and a little knowledge of CMake is required to compile the code. With C++, we might end up with a lot of prerequisites knowledge and other overheads.

Due to the simplicity of Python, we can see why most of the Machine Learning libraries are written in Python. It's easier to make prototypes and estimate model parameters in Python by giving up a little bit of performance.

References

  1. Stackoverflow. How can I count all the lines of code in a directory recursively? https://stackoverflow.com/questions/1358540/how-can-i-count-all-the-lines-of-code-in-a-directory-recursively