Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Super Vectorizer transforms data to sparse matrices

See original GitHub issue

Actual behavior

The Super Vectorizer transform and fit_transform methods have the following rule: “If any result is a sparse matrix, everything will be converted to sparse matrices.” This is the scipy.sparse.csr.csr_matrix type.

However, this type is not commonly accepted for further analysis. For instance, when applying a cross_val_score() we need to first make the result an array to be able to apply the method. This makes also the direct introduction of pipelines in cross_val_score() impossible, as an error will appear.

Expected behavior

Sparse matrices happen when the encoded variable has a lot of categories. Maybe introduce a sparse=True parameter, just like for the sklearn OHE, that will return sparse matrix if set True and array if False.

Easy code to reproduce bug

import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.experimental import enable_hist_gradient_boosting
# now you can import the HGBR from ensemble
from sklearn.ensemble import HistGradientBoostingRegressor
from dirty_cat import SuperVectorizer

np.random.seed(444) 
col1 = np.random.choice(  
     a=[0, 1, 2, 3],  
     size=50,  
     p=[0.4, 0.3, 0.2, 0.1])

col2 = np.random.choice(  
     a=['a', 'b', 'c'],  
     size=50,  
     p=[0.4, 0.4, 0.2])

results = np.random.uniform( 
     size=50)

df = pd.DataFrame(np.array([col1, col2, results])).transpose()

X = df.drop(columns=[2])
y = df[2]

sup_vec = SuperVectorizer()

pipeline = make_pipeline(
    SuperVectorizer(auto_cast=True, sparse_threshold=0.3),
    HistGradientBoostingRegressor()
)

cross_val_score(pipeline, X, y)

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

GaelVaroquauxcommented, Mar 31, 2022

Wouldn't it be redundant with sparse_threshold ?

I agree. The question is whether to keep one option or the other.

Let’s be consistent with upstream.

1reaction

LilianBoulardcommented, Mar 31, 2022

However, with this default value of sparse_threshold, the implementation of the pipeline will very often lead to an error. This is because the creation of sparse matrices seems very likely.

I see, you’re right. This is the default value used in sklearn’s ColumnTransformer, which we inherit. It might be interesting to know why they settled on this one !

Another alternative would be to have a sparse= option such as here.

Wouldn’t it be redundant with sparse_threshold ?

there is no mention of this option on this page, so we should maybe add it?

Indeed, that’s a problem ! Its description is missing from the SuperVectorizer’s docstring. We could either copy-paste the section from the ColumnTransformer’s, or redirect the user. I guess our top priority is user experience, so I’d go for the former (copy-paste missing doc from CT).

Top Results From Across the Web

Chapter 4. Text Vectorization and Transformation Pipelines

The vectorizer returns a sparse matrix representation in the form of ((doc, term), tfidf) where each key is a document and term pair...

scipy.sparse.csr_matrix — SciPy v1.9.3 Manual

Sparse matrices can be used in arithmetic operations: they support addition, subtraction, multiplication, division, and matrix power.

Vectorizing Sparse Matrix Computations with Partially-Strided ...

Abstract—The compact data structures and irregular compu- tation patterns in sparse matrix computations introduce chal- lenges to vectorizing these codes.

sklearn.feature_extraction.DictVectorizer

This transformer turns lists of mappings (dict-like objects) of feature names to feature values into Numpy arrays or scipy.sparse matrices for use with ......

Sparse Matrices For Efficient Machine Learning

All in all, converting sparse matrices to the sparse matrix format almost always yields some efficiency in processing time. We saw this to...