Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). This common technique is used in many fields, including image analysis and unsupervised document classification. In scikit-learn, clustering of unlabeled data can be performed with the module sklearn.cluster. However, in the current implementation of scikit-learn, one of the most popular clustering algorithm, KMeans, only support float64 input data and will therefore implicitly convert other input data types, e.g., float32, into float64, which may cause seriously memory waste.

Below, I’ll briefly introduce KMeans algorithms and go through the work I’ve done to make it become memory-efficient during GSoC.

KMeans

KMeans is probably one of the most well-knowned clustering algorithm since it is both effective and easy to be implemented.

To understand KMeans algorithm, I think it is good to start from these figures which clearly illustrate how KMeans works.

Training examples are shown as dots, and cluster centroids are shown as crosses.

  • (a) Original dataset.
  • (b) Random initial cluster centroids.
  • (c-f) Illustration of running two iterations of k-means. In each iteration, we

    1. Assign each training example to the closest cluster centroid (shown by “painting” the training examples the same color as the cluster centroid to which is assigned)
    2. Move each cluster centroid to the mean of the points assigned to it.

For more details, Andrew Ng’s course note is a good reference.

In scikit-learn, KMeans implements the algorithm described above, and MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function.

Memory Wasting Issues

However, original implementation of these two algorithms in scikit-learn are not memory-efficient, they will convert input data into np.float64 since Cython implementation only supports double input data.

Here’s a simple test script which can help us identify the memory wasting issues:

import numpy as np
from scipy import sparse as sp
from sklearn.cluster import KMeans

@profile
def fit_est():
    estimator.fit(X)

np.random.seed(5)
X = np.random.rand(200000, 20)

# Toggle the following comment to test `np.float32` data
# X = np.float32(X)

# Toggle the following comment to test sprase data
# X = sp.csr_matrix(X)

estimator = KMeans()
fit_est()

You can run

mprof run <script>
mprof plot

to see the memory profiling results.

To save your effort, below is the result of memory profiling on my own computer:

  • Dense np.float32 data & Dense np.float64 data

No surprise, these two kinds of input data have the same memory usage, which means that there is a huge waste when we pass np.float32 data into original KMeans of scikit-learn because it requires same memory space as np.float64 data.

To solve this problem, we can introduce Cython fused types to avoid data copying.

Enhanced Results

After PR #6846, now both KMeans and MiniBatchKMeans support fused types and can therefore use np.float32 data as input directly.

Below are the memory profiling results comparison:

  • Dense np.float32 data
    • Before enhancement:
  • Dense np.float32 data
    • After enhancement:
  • Sparse np.float32 data
    • Before enhancement:
  • Sparse np.float32 data
    • After enhancement:

As one can see, introducing Cython fused types drastically reduces the memory usage of KMeans in scikit-learn, which can help us avoid unexpected memory waste.

Note that in the sparse case, I directly transform the dense array used before into sparse array, which will result in higher memory usage since sparse format uses more space per nonzero value.

Summary

Both KMeans and MiniBatchKMeans now support np.float32 input data in a memory-efficient way, go grab your huge dataset and feed them into KMeans of scikit-learn now!