5 min read
Kmeans.cpp

What is K-means clustering

At its core, K-Means is an algorithm designed to solve a specific optimization problem: minimizing the Within-Cluster Sum of Squares (WCSS) (metric in clustering that quantifies cluster quality by summing the squared distances from each point to its cluster’s centroid), also known as Inertia.

J=j=1KiCjxiμj2J = \sum_{j=1}^{K} \sum_{i \in C_j} ||x_i - \mu_j||^2

This equation tells us that a “good” cluster is one where the variance (squared distance from the center) is minimized.

Why is this hard? Finding the global minimum for this function is NP-Hard as there are roughly KNK^N ways to partition N points into K clusters. You cannot check them all.

K-Means is a specific variant of the Expectation-Maximization algorithm which has 2 key phases:

Phase 1: The Expectation (E-Step) / Assignment:

  • “Given the current centroids, what is the most likely cluster for each point?”
  • We minimize JJ with respect to the assignments (Cj)(C_j) while holding the centroids (μj)(\mu_j) constant
  • The mathematically optimal assignment for a point xix_i is simply the cluster jj with the closest centroid μj\mu_j.

Phase 2: The Maximization (M-Step) / Update

  • “Given the current assignments, what is the best centroid for this cluster?”
  • We minimize JJ with respect to the centroids μj\mu_j while holding the assignments (Cj)(C_j) constant.
  • If you take the derivative of the summation (xiμ)2\sum (x_i - \mu)^2 with respect to μ\mu and set it to 0, the result is exactly the arithmetic mean.

μjiCj(xiμj)2=2iCj(xiμj)=0    xi=μj    μj=1CjiCjxi\frac{\partial}{\partial \mu_j} \sum_{i \in C_j} (x_i - \mu_j)^2 = -2 \sum_{i \in C_j} (x_i - \mu_j) = 0 \implies \sum x_i = \sum \mu_j \implies \mu_j = \frac{1}{|C_j|} \sum_{i \in C_j} x_i

The E-Step (reassigning points) strictly decreases (or keeps equal) the total error JJ, because we explicitly move points to closer centroids.

The M-Step (moving centroids) strictly decreases (or keeps equal) the total error JJ, because the mean is the unique minimizer of squared error.

Taking a look at the time complexity:

Time Complexity:O(TKND)T:Number of iterations (usually small,e.g.,2050).K:Number of clusters.N:Number of data points.D:Dimensions (features).Time\ Complexity: O(T \cdot K \cdot N \cdot D)\\T: Number \ of \ iterations \ (usually \ small, e.g., 20-50).\\K: Number \ of \ clusters.\\N: Number \ of \ data \ points.\\D: Dimensions \ (features).

If NN is massive, the linear dependence on NN is your bottleneck. You might need “Mini-Batch K-Means” later.

Mini Batch K-Means

The updating of centroids requires the entire dataset to be fit into the memory. If the dataset is large maybe dataset size ≥ 10^6, then using mini batches would be more efficient. It would only need the K centroids and the small mini-batch (B) in memory. It would be significantly faster per iteration, but may require more total iterations.

Taking a look at the formula, the centroid update is a weighted average between the old centroid position and the new data point.

cnew=(1η)cold+ηx\mathbf{c}_{new} = (1 - \eta)\mathbf{c}_{old} + \eta\mathbf{x}

Where: • cold\mathbf{c}_{old} be the current position of the centroid. • x\mathbf{x} be the new data point (or the average of the batch assigned to this cluster). • nn be the total count of points assigned to this cluster throughout the entire training process (history + current batch). • η\eta (eta) be the learning rate, typically defined as η=1n\eta = \frac{1}{n}.

The time complexity is

Time Complexity:O(TKbD)T:Number of iterations .K:Number of clusters.b:Mini batch size.D:Dimensions (features).Time\ Complexity: O(T \cdot K \cdot b \cdot D)\\T: Number \ of \ iterations\ .\\K: Number \ of \ clusters.\\b: Mini \ batch \ size.\\D: Dimensions \ (features).

Since the batch size bb is usually much smaller than the dataset size NN (bNb \ll N), Mini-Batch K-Means is significantly faster per iteration.

The Code

I’ve come up with some naive c++ code implementation for Kmeans. I think the code itself isnt too hard to understand so I will be linking it here so that you can read it. But most people dont wants to use a library written in c++ so i will create python bindings so that it is more accessible.

For the binding part, I made use of pybind11. I created a bindings.cpp file, which tells python how to talk to the program.

#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include "kmeans.hpp"

namespace py = pybind11;

PYBIND11_MODULE(kmeans_cpp, m) {
    
    py::enum_<KMeansMode>(m, "KMeansMode")
        .value("BATCH", KMeansMode::BATCH)
        .value("MINI_BATCH", KMeansMode::MINI_BATCH)
        .export_values();

    py::class_<KMeans>(m, "KMeans")
        .def(py::init<int, KMeansMode, int, double, int>(),
             py::arg("k"), 
             py::arg("mode") = KMeansMode::BATCH, 
             py::arg("max_iter") = 300, 
             py::arg("tol") = 1e-4, 
             py::arg("b_size") = 1024)
        
        .def("fit", &KMeans::fit)
        .def("predict", &KMeans::predict)
        .def_property_readonly("centroids", &KMeans::getCentroids);
}

Afterwards, I also created a setup.py file to compile everything into a python package. You can also use pyproject.toml but I found setup.py to be more familiar and easier to setup (no pun intended).

from setuptools import setup, Extension
import pybind11

cpp_args = ['-std=c++11']

ext_modules = [
    Extension(
        'kmeans_cpp',    
        ['bindings.cpp'],
        include_dirs=[pybind11.get_include(), '.'],
        language='c++',
        extra_compile_args=cpp_args,
    ),
]

setup(
    name='kmeans_cpp',
    version='0.0.1',
    author='Your Name',
    description='C++ KMeans bindings',
    ext_modules=ext_modules,
)

Once you’ve setup the library, you can then start to make use of it.

import kmeans_cpp

data = [
    [1.0, 2.0], 
    [1.1, 2.1], 
    [10.0, 10.0], 
    [10.1, 10.1]
]

# Use the enum and class just like normal Python
mode = kmeans_cpp.KMeansMode.MINI_BATCH
model = kmeans_cpp.KMeans(k=2, mode=mode, b_size=2)

model.fit(data)
labels = model.predict(data)

print("Labels:", labels)
print("Centroids:", model.centroids)

In future posts, I’ll cover parallelizing K-means using OpenMP and CUDA for better performance on large datasets.

GitHub - gangula-karthik/kmeans.cpp