What is K-means clustering
At its core, K-Means is an algorithm designed to solve a specific optimization problem: minimizing the Within-Cluster Sum of Squares (WCSS) (metric in clustering that quantifies cluster quality by summing the squared distances from each point to its cluster’s centroid), also known as Inertia.
This equation tells us that a “good” cluster is one where the variance (squared distance from the center) is minimized.
Why is this hard? Finding the global minimum for this function is NP-Hard as there are roughly ways to partition N points into K clusters. You cannot check them all.
K-Means is a specific variant of the Expectation-Maximization algorithm which has 2 key phases:
Phase 1: The Expectation (E-Step) / Assignment:
- “Given the current centroids, what is the most likely cluster for each point?”
- We minimize with respect to the assignments while holding the centroids constant
- The mathematically optimal assignment for a point is simply the cluster with the closest centroid .
Phase 2: The Maximization (M-Step) / Update
- “Given the current assignments, what is the best centroid for this cluster?”
- We minimize with respect to the centroids while holding the assignments constant.
- If you take the derivative of the summation with respect to and set it to 0, the result is exactly the arithmetic mean.
The E-Step (reassigning points) strictly decreases (or keeps equal) the total error , because we explicitly move points to closer centroids.
The M-Step (moving centroids) strictly decreases (or keeps equal) the total error , because the mean is the unique minimizer of squared error.
Taking a look at the time complexity:
If is massive, the linear dependence on is your bottleneck. You might need “Mini-Batch K-Means” later.
Mini Batch K-Means
The updating of centroids requires the entire dataset to be fit into the memory. If the dataset is large maybe dataset size ≥ 10^6, then using mini batches would be more efficient. It would only need the K centroids and the small mini-batch (B) in memory. It would be significantly faster per iteration, but may require more total iterations.
Taking a look at the formula, the centroid update is a weighted average between the old centroid position and the new data point.
Where: • be the current position of the centroid. • be the new data point (or the average of the batch assigned to this cluster). • be the total count of points assigned to this cluster throughout the entire training process (history + current batch). • (eta) be the learning rate, typically defined as .
The time complexity is
Since the batch size is usually much smaller than the dataset size (), Mini-Batch K-Means is significantly faster per iteration.
The Code
I’ve come up with some naive c++ code implementation for Kmeans. I think the code itself isnt too hard to understand so I will be linking it here so that you can read it. But most people dont wants to use a library written in c++ so i will create python bindings so that it is more accessible.
For the binding part, I made use of pybind11. I created a bindings.cpp file, which tells python how to talk to the program.
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include "kmeans.hpp"
namespace py = pybind11;
PYBIND11_MODULE(kmeans_cpp, m) {
py::enum_<KMeansMode>(m, "KMeansMode")
.value("BATCH", KMeansMode::BATCH)
.value("MINI_BATCH", KMeansMode::MINI_BATCH)
.export_values();
py::class_<KMeans>(m, "KMeans")
.def(py::init<int, KMeansMode, int, double, int>(),
py::arg("k"),
py::arg("mode") = KMeansMode::BATCH,
py::arg("max_iter") = 300,
py::arg("tol") = 1e-4,
py::arg("b_size") = 1024)
.def("fit", &KMeans::fit)
.def("predict", &KMeans::predict)
.def_property_readonly("centroids", &KMeans::getCentroids);
}
Afterwards, I also created a setup.py file to compile everything into a python package. You can also use pyproject.toml but I found setup.py to be more familiar and easier to setup (no pun intended).
from setuptools import setup, Extension
import pybind11
cpp_args = ['-std=c++11']
ext_modules = [
Extension(
'kmeans_cpp',
['bindings.cpp'],
include_dirs=[pybind11.get_include(), '.'],
language='c++',
extra_compile_args=cpp_args,
),
]
setup(
name='kmeans_cpp',
version='0.0.1',
author='Your Name',
description='C++ KMeans bindings',
ext_modules=ext_modules,
)
Once you’ve setup the library, you can then start to make use of it.
import kmeans_cpp
data = [
[1.0, 2.0],
[1.1, 2.1],
[10.0, 10.0],
[10.1, 10.1]
]
# Use the enum and class just like normal Python
mode = kmeans_cpp.KMeansMode.MINI_BATCH
model = kmeans_cpp.KMeans(k=2, mode=mode, b_size=2)
model.fit(data)
labels = model.predict(data)
print("Labels:", labels)
print("Centroids:", model.centroids)
In future posts, I’ll cover parallelizing K-means using OpenMP and CUDA for better performance on large datasets.