scikitlearn Sparse Utils Now Support Fused Types
Dealing with sparse data is fairly common when we are anlyzing large datasets. However, sparse function utilities in scikitlearn only support float64
currently and will therefore implicitly convert other input data types, e.g., float32
, into float64
, which may cause unexpected memory error. Since Cython fused types allow us to have one type definition that can refer to multiple types, we can solve potential memory wasting issue by substituting float64
with Cython fused types.
Below, I’ll briefly introduce sparse function utilities in scikitlearn and describe the work I’ve done to enhance it during GSoC.
Sparse Function Utilities
In scikitlearn, sparse data is often represented as scipy.sparse.csr_matrix
or scipy.sparse.csc_matrix
. However, these two matrices do not provide builtin methods to calculate important statistics such as L2 norm and variance, which are useful when we are playing with data. Therefore, scikitlearn leverages on sparsefuncs_fast.pyx
which defines some helper methods for sparse matrices to handle sparse data more conveniently throughout the project.
Memory Wasting Issue
However, original implementation of sparse function utilities in scikitlearn is not memoryefficient.
Let’s take a simple function which do not use Cython fused types and will calculate L2 norm of a CSR matrix X
as example:
def csr_row_norms(X):
"""L2 norm of each row in CSR matrix X."""
# 1
cdef:
unsigned int n_samples = X.shape[0]
unsigned int n_features = X.shape[1]
np.ndarray[np.float64_t, ndim=1, mode="c"] norms
np.ndarray[np.float64_t, ndim=1, mode="c"] data
np.ndarray[int, ndim=1, mode="c"] indices = X.indices
np.ndarray[int, ndim=1, mode="c"] indptr = X.indptr
np.npy_intp i, j
double sum_
# 2
norms = np.zeros(n_samples, dtype=np.float64)
# 3
data = np.asarray(X.data, dtype=np.float64) # Warning: might copy!
# 4
for i in range(n_samples):
sum_ = 0.0
for j in range(indptr[i], indptr[i + 1]):
sum_ += data[j] * data[j]
norms[i] = sum_
np.sqrt(norms)
return norms

Declare Cython’s statictyped (in contrast to Python’s dynamictyped) variables to store attributes of the input CSR matrix X since statictyped variables can accelerate the computation a lot.

Initialize
norms
with 0s. 
Since we’ve already used
cdef
to declaredata
as anp.ndarray
which containsnp.float64_t
element in step 1, data of X need to be converted into typenp.float64
if it belongs to other data types. 
Calculate the squared sum of each row and then take squared root of it to get L2 norm.
As illustrated above, we can see that STEP 3 IS DANGEROUS because converting type of data may implicitly copy the the data and then cause memory error unexpectedly. To see how it will affect the memory space, we can use memory_profiler to monitor memory usage.
Here is the result of memory profiling if we pass a scipy.sparse.csr_matrix
with np.float32
element into our example function:
It is abvious that memory usage increase dramatically because step 3 copies the data so as to convert it from np.float32
to np.float64
.
To solve this problem, we can introduce Cython fused types to avoid data copying. But firstly, let’s take a brief look at Cython fused types.
Cython Fused Types
Here is official page’s clear introduction for fused types:
Fused types allow you to have one type definition that can refer to multiple types. This allows you to write a single statictyped cython algorithm that can operate on values of multiple types. Thus fused types allow generic programming and are akin to templates in C++ or generics in languages like Java / C#.
Note that Cython fused types are specialized at compile time, and are constant for a particular function invocation.
By adopting Cython fused types, our function can accept multiple types and therefore doesn’t need to do datatype conversion.
Common Pitfalls
Intuitively, in order to integrate Cython fused types to solve the memory issue described above, we will delete step 3 and change step 1 in our function as follows:
# 1
cdef:
unsigned int n_samples = X.shape[0]
unsigned int n_features = X.shape[1]
# Change type from np.float64_t to floating
np.ndarray[floating, ndim=1, mode="c"] norms
np.ndarray[floating, ndim=1, mode="c"] data = X.data
np.ndarray[int, ndim=1, mode="c"] indices = X.indices
np.ndarray[int, ndim=1, mode="c"] indptr = X.indptr
np.npy_intp i, j
double sum_
However, above changes will cause Cython compile error Invalid use of fused types, type cannot be specialized
.
It seems that Cython doesn’t allow us to declare fused types variable and then assign value to it within a function if this function doesn’t accept any argument that has type involves the same fused types. Hence, we need to introduce a implementation trick here.
Enhanced Implementation
The trick I used here is to define a wrapper function and make its underlying implementation function accept fused types arguments. The reason behind this is mentioned above:
If a function accepts some argument that has a particular fused type, it can use cdef
to declare and init variable with that particular fused type within its scope.
Code of enhanced implementation is showed below:
# Wrapper function
def csr_row_norms(X):
"""L2 norm of each row in CSR matrix X."""
return _csr_row_norms(X.data, X.shape, X.indices, X.indptr)
# Underlying implementation function
def _csr_row_norms(np.ndarray[floating, ndim=1, mode="c"] X_data,
shape,
np.ndarray[int, ndim=1, mode="c"] X_indices,
np.ndarray[int, ndim=1, mode="c"] X_indptr):
cdef:
unsigned int n_samples = shape[0]
unsigned int n_features = shape[1]
np.ndarray[DOUBLE, ndim=1, mode="c"] norms
np.npy_intp i, j
double sum_
norms = np.zeros(n_samples, dtype=np.float64)
for i in range(n_samples):
sum_ = 0.0
for j in range(X_indptr[i], X_indptr[i + 1]):
sum_ += X_data[j] * X_data[j]
norms[i] = sum_
return norms
Finally, to verify our enhancement, here is the result of memory profiling if we pass a scipy.sparse.csr_matrix
with np.float32
element into our enhamced function:
Cool! As what figure shows, our function no longer copy the data anymore.
Summary
All of the functions in sparsefuncs_fast.pyx
now support Cython fused types! Great thanks to all of the reviewers and their useful opinions.
In the next few weeks, my goal is to work on clustering algorithms such as KMeans in scikitlearn so as to make it also support Cython fused types.