scikitlearn KMeans Now Support Fused Types
Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). This common technique is used in many fields, including image analysis and unsupervised document classification. In scikitlearn, clustering of unlabeled data can be performed with the module sklearn.cluster. However, in the current implementation of scikitlearn, one of the most popular clustering algorithm, KMeans, only support float64
input data and will therefore implicitly convert other input data types, e.g., float32
, into float64
, which may cause seriously memory waste.
Below, I’ll briefly introduce KMeans algorithms and go through the work I’ve done to make it become memoryefficient during GSoC.
KMeans
KMeans is probably one of the most wellknowned clustering algorithm since it is both effective and easy to be implemented.
To understand KMeans algorithm, I think it is good to start from these figures which clearly illustrate how KMeans works.
Training examples are shown as dots, and cluster centroids are shown as crosses.
 (a) Original dataset.
 (b) Random initial cluster centroids.

(cf) Illustration of running two iterations of kmeans. In each iteration, we
 Assign each training example to the closest cluster centroid (shown by “painting” the training examples the same color as the cluster centroid to which is assigned)
 Move each cluster centroid to the mean of the points assigned to it.
For more details, Andrew Ng’s course note is a good reference.
In scikitlearn, KMeans implements the algorithm described above, and MiniBatchKMeans is a variant of the KMeans algorithm which uses minibatches to reduce the computation time, while still attempting to optimise the same objective function.
Memory Wasting Issues
However, original implementation of these two algorithms in scikitlearn are not memoryefficient, they will convert input data into np.float64
since Cython implementation only supports double
input data.
Here’s a simple test script which can help us identify the memory wasting issues:
import numpy as np
from scipy import sparse as sp
from sklearn.cluster import KMeans
@profile
def fit_est():
estimator.fit(X)
np.random.seed(5)
X = np.random.rand(200000, 20)
# Toggle the following comment to test `np.float32` data
# X = np.float32(X)
# Toggle the following comment to test sprase data
# X = sp.csr_matrix(X)
estimator = KMeans()
fit_est()
You can run
mprof run <script>
mprof plot
to see the memory profiling results.
To save your effort, below is the result of memory profiling on my own computer:
 Dense
np.float32
data & Densenp.float64
data
No surprise, these two kinds of input data have the same memory usage, which means that there is a huge waste when we pass np.float32
data into original KMeans of scikitlearn because it requires same memory space as np.float64
data.
To solve this problem, we can introduce Cython fused types to avoid data copying.
Enhanced Results
After PR #6846, now both KMeans and MiniBatchKMeans support fused types and can therefore use np.float32
data as input directly.
Below are the memory profiling results comparison:
 Dense
np.float32
data Before enhancement:
 Dense
np.float32
data After enhancement:
 Sparse
np.float32
data Before enhancement:
 Sparse
np.float32
data After enhancement:
As one can see, introducing Cython fused types drastically reduces the memory usage of KMeans in scikitlearn, which can help us avoid unexpected memory waste.
Note that in the sparse case, I directly transform the dense array used before into sparse array, which will result in higher memory usage since sparse format uses more space per nonzero value.
Summary
Both KMeans and MiniBatchKMeans now support np.float32
input data in a memoryefficient way, go grab your huge dataset and feed them into KMeans of scikitlearn now!