A research team has released Flash-KMeans, an open-source library that runs exact K-means clustering more than 200 times faster than FAISS on GPUs, without changing the math. K-means is one of the oldest and most widely used clustering algorithms, the workhorse behind everything from vector-database indexing to image compression, and Flash-KMeans does not approximate it or alter Lloyd's classic procedure. Instead, it rethinks how the algorithm moves data through a modern GPU, the same insight that made FlashAttention fast: the bottleneck is not the arithmetic, it is the memory traffic.
The speedup comes from two kernel-level ideas. The first, FlashAssign, fuses the distance computation with an online argmin so the assignment step never has to write a giant intermediate distance matrix out to memory and read it back; the work stays in fast on-chip memory. The second, called sort-inverse update, attacks the other expensive step, recomputing each cluster's center. Done naively, that step has thousands of GPU threads fighting to add into the same few accumulators, a high-contention pattern that throttles the hardware. Flash-KMeans builds an explicit inverse mapping that turns those scattered atomic writes into orderly, high-bandwidth, segment-level reductions instead.
On an NVIDIA H200, the team reports up to a 17.9 times end-to-end speedup over the best prior baseline, 33 times faster than NVIDIA's own cuML, and more than 200 times faster than FAISS, the library most people reach for when they need K-means at scale. And it is exact, producing the same clusters Lloyd's algorithm always would, not a faster approximation that quietly trades away accuracy. The library is written in Triton, the Python-like language for GPU kernels, ships under the permissive Apache 2.0 license, and installs with a single pip command. The paper is on arXiv and the code is on GitHub.
The result is a small, sharp example of a pattern that keeps paying off in machine learning systems: the biggest wins often come not from a new algorithm but from re-implementing an old one to respect the memory hierarchy of the hardware it runs on. FlashAttention did this for the transformer's attention, and Flash-KMeans does it for clustering, a primitive that sits underneath retrieval, deduplication, and the vector search powering a lot of today's AI applications. A 200 times speedup on an exact, unchanged algorithm is not a marginal optimization, it is the difference between clustering a dataset in seconds and waiting minutes. The honest caveat is that headline multipliers depend on the hardware, the dataset, and how the baseline was configured, so real-world gains will vary, but the direction is clear, and the code is right there to check.
