Flash-KMeans Accelerates K-Means Algorithm Over 200x on GPUs
Researchers from UC Berkeley and UT Austin have developed Flash-KMeans, an open-source library that significantly accelerates the standard Lloyd's k-means algorithm on GPUs. This IO-aware implementation, built with Triton GPU kernels, focuses on restructuring data movement rather than altering the core mathematics or approximating results. Benchmarking on an NVIDIA H200 GPU indicates Flash-KMeans achieves an end-to-end speedup of up to 17.9 times over leading baselines, 33 times over NVIDIA cuML, and more than 200 times faster than FAISS.
Flash-KMeans is designed to address the increasing demand for k-means computations within modern AI training and inference loops, where call latency is critical. The library ships under an Apache 2.0 license and is installable via pip.
The performance gains stem from Flash-KMeans' approach to two primary bottlenecks in GPU-based k-means: the assignment stage and the centroid update stage. The assignment stage, which calculates each point's distance to every centroid, traditionally involves materializing a large distance matrix in high-bandwidth memory (HBM). Flash-KMeans replaces this with FlashAssign, an approach that streams tiles of points and centroids from HBM into on-chip SRAM, fusing distance computation with an online argmin. This eliminates the need to materialize the full N×K matrix, reducing IO complexity.
For the centroid update stage, standard methods often rely on scatter-style atomic additions, leading to contention when multiple threads target the same 'hot' cluster. Flash-KMeans introduces Sort-Inverse Update. This method sorts the 1D assignment vector by cluster ID, creating contiguous segments for identical cluster IDs. Each thread block then reduces a segment on-chip before issuing a single atomic add per segment, thereby reducing atomic contention.
Specific kernel-level optimizations show FlashAssign achieving up to 21.2 times speedup in assignment operations, while Sort-Inverse Update reaches up to 6.3 times speedup in centroid updates. For large-scale out-of-core operations, Flash-KMeans reports up to 10.5 times faster performance compared to fastkmeans. These optimizations allow Flash-KMeans to operate efficiently in scenarios where other libraries, such as FAISS, may encounter memory limitations or slower processing for large datasets.
According to Marktechpost, Flash-KMeans offers a mathematically identical output to standard Lloyd's k-means, with speed improvements derived purely from kernel-level dataflow and optimized data movement.

