Kmeans.cpp | Karthik Blog

What is K-means clustering

At its core, K-Means is an algorithm designed to solve a specific optimization problem: minimizing the Within-Cluster Sum of Squares (WCSS) (metric in clustering that quantifies cluster quality by summing the squared distances from each point to its cluster’s centroid), also known as Inertia.

$J = \sum_{j=1}^{K} \sum_{i \in C_j} ||x_i - \mu_j||^2$

This equation tells us that a “good” cluster is one where the variance (squared distance from the center) is minimized.

Why is this hard? Finding the global minimum for this function is NP-Hard as there are roughly $K^N$ ways to partition N points into K clusters. You cannot check them all.

K-Means is a specific variant of the Expectation-Maximization algorithm which has 2 key phases:

Phase 1: The Expectation (E-Step) / Assignment:

“Given the current centroids, what is the most likely cluster for each point?”
We minimize $J$ with respect to the assignments $(C_j)$ while holding the centroids $(\mu_j)$ constant
The mathematically optimal assignment for a point $x_i$ is simply the cluster $j$ with the closest centroid $\mu_j$ .

Phase 2: The Maximization (M-Step) / Update

“Given the current assignments, what is the best centroid for this cluster?”
We minimize $J$ with respect to the centroids $\mu_j$ while holding the assignments $(C_j)$ constant.
If you take the derivative of the summation $\sum (x_i - \mu)^2$ with respect to $\mu$ and set it to 0, the result is exactly the arithmetic mean.

$\frac{\partial}{\partial \mu_j} \sum_{i \in C_j} (x_i - \mu_j)^2 = -2 \sum_{i \in C_j} (x_i - \mu_j) = 0 \implies \sum x_i = \sum \mu_j \implies \mu_j = \frac{1}{|C_j|} \sum_{i \in C_j} x_i$

The E-Step (reassigning points) strictly decreases (or keeps equal) the total error $J$ , because we explicitly move points to closer centroids.

The M-Step (moving centroids) strictly decreases (or keeps equal) the total error $J$ , because the mean is the unique minimizer of squared error.

Taking a look at the time complexity:

$Time\ Complexity: O(T \cdot K \cdot N \cdot D)\\T: Number \ of \ iterations \ (usually \ small, e.g., 20-50).\\K: Number \ of \ clusters.\\N: Number \ of \ data \ points.\\D: Dimensions \ (features).$

If $N$ is massive, the linear dependence on $N$ is your bottleneck. You might need “Mini-Batch K-Means” later.

Mini Batch K-Means

The updating of centroids requires the entire dataset to be fit into the memory. If the dataset is large maybe dataset size ≥ 10^6, then using mini batches would be more efficient. It would only need the K centroids and the small mini-batch (B) in memory. It would be significantly faster per iteration, but may require more total iterations.

Taking a look at the formula, the centroid update is a weighted average between the old centroid position and the new data point.

$\mathbf{c}_{new} = (1 - \eta)\mathbf{c}_{old} + \eta\mathbf{x}$

Where: • $\mathbf{c}_{old}$ be the current position of the centroid. • $\mathbf{x}$ be the new data point (or the average of the batch assigned to this cluster). • $n$ be the total count of points assigned to this cluster throughout the entire training process (history + current batch). • $\eta$ (eta) be the learning rate, typically defined as $\eta = \frac{1}{n}$ .

The time complexity is

$Time\ Complexity: O(T \cdot K \cdot b \cdot D)\\T: Number \ of \ iterations\ .\\K: Number \ of \ clusters.\\b: Mini \ batch \ size.\\D: Dimensions \ (features).$

Since the batch size $b$ is usually much smaller than the dataset size $N$ ( $b \ll N$ ), Mini-Batch K-Means is significantly faster per iteration.

The Code

I’ve come up with some naive c++ code implementation for Kmeans. I think the code itself isnt too hard to understand so I will be linking it here so that you can read it. But most people dont wants to use a library written in c++ so i will create python bindings so that it is more accessible.

For the binding part, I made use of pybind11. I created a bindings.cpp file, which tells python how to talk to the program.

#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include "kmeans.hpp"

namespace py = pybind11;

PYBIND11_MODULE(kmeans_cpp, m) {
    
    py::enum_<KMeansMode>(m, "KMeansMode")
        .value("BATCH", KMeansMode::BATCH)
        .value("MINI_BATCH", KMeansMode::MINI_BATCH)
        .export_values();

    py::class_<KMeans>(m, "KMeans")
        .def(py::init<int, KMeansMode, int, double, int>(),
             py::arg("k"), 
             py::arg("mode") = KMeansMode::BATCH, 
             py::arg("max_iter") = 300, 
             py::arg("tol") = 1e-4, 
             py::arg("b_size") = 1024)
        
        .def("fit", &KMeans::fit)
        .def("predict", &KMeans::predict)
        .def_property_readonly("centroids", &KMeans::getCentroids);
}

Afterwards, I also created a setup.py file to compile everything into a python package. You can also use pyproject.toml but I found setup.py to be more familiar and easier to setup (no pun intended).

from setuptools import setup, Extension
import pybind11

cpp_args = ['-std=c++11']

ext_modules = [
    Extension(
        'kmeans_cpp',    
        ['bindings.cpp'],
        include_dirs=[pybind11.get_include(), '.'],
        language='c++',
        extra_compile_args=cpp_args,
    ),
]

setup(
    name='kmeans_cpp',
    version='0.0.1',
    author='Your Name',
    description='C++ KMeans bindings',
    ext_modules=ext_modules,
)

Once you’ve setup the library, you can then start to make use of it.

import kmeans_cpp

data = [
    [1.0, 2.0], 
    [1.1, 2.1], 
    [10.0, 10.0], 
    [10.1, 10.1]
]

# Use the enum and class just like normal Python
mode = kmeans_cpp.KMeansMode.MINI_BATCH
model = kmeans_cpp.KMeans(k=2, mode=mode, b_size=2)

model.fit(data)
labels = model.predict(data)

print("Labels:", labels)
print("Centroids:", model.centroids)

In future posts, I’ll cover parallelizing K-means using OpenMP and CUDA for better performance on large datasets.

GitHub - gangula-karthik/kmeans.cpp