Parallelizing K-Means with OpenMP

Intro

Right now what’s happening is that in our current implementation of kmeans clustering everything is running serially, one after another. But the thing is that these tasks can actually happen at the same time, in parallel. I will explore some of the things that I have done using openmp in order to parallelize the ml model.

OpenMP

So you might be wondering what is openmp? It’s just an API for writing parallel processing applications in C++. The thing is that it makes it very easy for developers to use through the use of #pragma directives, handling all the thread creation and synchronization unlike something like pthreads library in C++ which is a more low level library for working with threads.

So how does openmp work? It uses something called the fork-join model where a single master thread executes the program sequentially. When this thread encounters a parallel directive (something like #pragma omp parallel), it forks (or spawns) a team of worker threads. These threads then execute the code block inside the directive simultaneously. The work is distributed amongst these threads. At the end of the parallel block, there is an implicit barrier where the master thread waits for all the worker threads to finish executing their work, upon which these threads are put to sleep in a pool and the master thread resumes its work.

Parallelizing KMeans Algo

Now in order to parallelize the algorithm we need to find these parallel regions, and include the directives. K-Means is considered “embarrassingly parallel” because the vast majority of its computational work consists of independent calculations that require zero communication between threads.

In the previous blog I have mentioned about the E-step, which calculates the distance between the point and centroid. It’s basically the predict function (and the first half of the fit function too). This is embarrassingly parallel. All threads read from centroids (which are constant during this step) and data. They never modify them. Every thread writes to a unique index in the output vector assignments[i], so no collision is physically possible. Because the threads don’t need to talk to each other, wait for each other, or lock anything, you get near-perfect linear speedup (4 cores = ~4x speed).

The next half of the fit function is called the m-step. Here we want to find given the current assignments, what is the best centroid for this cluster. In this task we need to sum up the coordinates of all the points that are belonging to the cluster 0, 1, 2, etc. The function is parallelizable but they need to write to the same destination. Whenever we need to write to the same destination, there is a risk of race condition (will be explained in the next section). We would need to use some kind of synchronization to prevent race conditions, which adds overhead making it less efficient than the e-step.

At the end of 1 iteration (every fit loop), all threads must stop and wait. We can’t just leave iteration 1 midway and move to the second one. If one thread is slow, all must wait. And when there is a significant delay because of this it is called tail latency problem.

Challenges in Conversion

A race condition occurs when multiple threads try to modify the same memory location at the same time. The final result depends on the luck of the timing, leading to random errors. There are a few cases where I faced this and how I prevented it:

In the predict function I was using push_back to push the results into a vector. But push_back function is not thread safe. 2 threads might push the same value in accidentally. How this can be prevented is to just preallocate the vector and pad it with 0. Then write exactly to the index that you want to write directly.
```
std::vector<int> labels(data.size());

#pragma omp parallel for
for (size_t i=0; i<data.size(); i++) {
    // FIX: Threads write to unique, non-overlapping indices.
    labels[i] = getClosestClusterIndex(data[i]);
}
```
Random number generation was also not thread safe as multiple threads could corrupt the internal state. To fix this I just offset the seed value of the random num gen with the thread id.
```
std::mt19937 local_gen(rd() + omp_get_thread_num());
```
In the standard K-Means update step, multiple threads needed to add their point’s coordinates to a shared new_centroids_sum vector. If Thread A reads a value (e.g., 5.0) and Thread B reads the same value before A writes back, one of the additions is lost. For this I used #pragma omp atomic. This is lightweight and efficient for simple increment or addition operations. It tells the CPU to lock that specific memory address for the single instruction.
```
// Threads trying to add to the same centroid coordinate
#pragma omp atomic
new_centroids_sum[cluster_id][d] += data[i][d];
```
In the Mini-Batch update, I had a block of logic where steps depended on each other:
- Increment the count.
- Calculate the learning rate (eta) based on that exact count.
- Update the centroid.
If I used atomic on each line individually, a context switch could happen between the lines. Thread B could increment the count while Thread A was calculating eta, causing Thread A to use the wrong learning rate.

I used #pragma omp critical to create a “Critical Section.” This locks the entire block of code, ensuring that once a thread enters, it finishes all three steps before anyone else can touch those variables.
```
#pragma omp critical
{
    // These 3 lines must happen atomically together
    global_counts[c]++;
    double eta = 1.0 / global_counts[c];
    centroids[c][d] = ...
}
```
I needed to track global states, such as the changed flag (did any point move?) or finding the max_change across the entire dataset. Creating a shared variable and using atomic or critical to update it inside a loop is incredibly slow because it creates a bottleneck where threads wait in line. I used OpenMP Reductions. This gives every thread its own private copy of the variable (e.g., false or 0.0). At the end of the loop, OpenMP efficiently combines them (using logical OR for the boolean, or MAX for the double) without explicit locking.
```
bool changed = false;
// Safely combines all thread results using Logical OR (||)
#pragma omp parallel for reduction(|| : changed)
for (int i=0; i<N; i++) {
    if (...) changed = true;
}
```

Future Improvements

1. Eliminating Atomic Overhead with “Array Reduction” (Privatization)

Currently, we use #pragma omp atomic to safely update the global centroid sums. While correct, this creates contention. If 4 threads try to add to “Cluster 0” at the same time, 3 of them must wait.

The Fix: We can give every thread its own private new_centroids_sum array. Threads write to their private copies without any locks (zero waiting). Once the loop finishes, we merge all the private arrays into the global one. This trades higher memory usage for maximum speed.

2. Solving “False Sharing”

Our counts vector stores integers consecutively in memory. Modern CPUs read memory in 64-byte chunks called Cache Lines.

The Problem: counts[0] and counts[1] likely sit on the same cache line. If Core 1 updates counts[0] and Core 2 updates counts[1], the CPU cores fight over ownership of that specific cache line, causing massive traffic on the interconnect bus.
The Fix: We can implement Padding. We force each counter to sit on its own start of a cache line (aligning data to 64 bytes), ensuring that threads on different cores never touch the same physical cache line.

3. Vectorization (SIMD)

OpenMP handles Thread-Level Parallelism (using multiple cores). However, we are not fully utilizing Instruction-Level Parallelism (doing more math per cycle on a single core).

The Fix: We can use #pragma omp simd inside our distance calculation loops. This instructs the compiler to use AVX/AVX2 registers, allowing the CPU to calculate the distance for 4 or 8 dimensions in a single CPU cycle, rather than one by one.

Is 8 threads faster than 4?

We might assume that if your CPU has 8 logical threads (due to Hyper-Threading/SMT), running K-Means with 8 threads is automatically faster than 4.

The Answer: Usually No.

Hyper-Threading works by duplicating the architectural state (registers) of a core, but not the actual Execution Units (the ALU/FPU calculators). It is designed to let Thread B work while Thread A is waiting for slow memory (RAM).

However, K-Means is a math-heavy algorithm. It doesn’t wait for memory much; it constantly hammers the Floating Point Unit (FPU) to calculate distances.

If you put 2 threads on 1 physical core, they both fight for the same FPU.
Thread B ends up stalling until Thread A finishes its math.

For compute-bound tasks like this, it is often faster to pin your thread count to the number of Physical Cores (OMP_NUM_THREADS=4) rather than Logical Cores, to avoid the overhead of context switching without the benefit of extra computation power.