question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Super Vectorizer transforms data to sparse matrices

See original GitHub issue

Actual behavior

The Super Vectorizer transform and fit_transform methods have the following rule: “If any result is a sparse matrix, everything will be converted to sparse matrices.” This is the scipy.sparse.csr.csr_matrix type.

However, this type is not commonly accepted for further analysis. For instance, when applying a cross_val_score() we need to first make the result an array to be able to apply the method. This makes also the direct introduction of pipelines in cross_val_score() impossible, as an error will appear.

Expected behavior

Sparse matrices happen when the encoded variable has a lot of categories. Maybe introduce a sparse=True parameter, just like for the sklearn OHE, that will return sparse matrix if set True and array if False.

Easy code to reproduce bug

import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.experimental import enable_hist_gradient_boosting
# now you can import the HGBR from ensemble
from sklearn.ensemble import HistGradientBoostingRegressor
from dirty_cat import SuperVectorizer

np.random.seed(444) 
col1 = np.random.choice(  
     a=[0, 1, 2, 3],  
     size=50,  
     p=[0.4, 0.3, 0.2, 0.1])

col2 = np.random.choice(  
     a=['a', 'b', 'c'],  
     size=50,  
     p=[0.4, 0.4, 0.2])

results = np.random.uniform( 
     size=50)

df = pd.DataFrame(np.array([col1, col2, results])).transpose()

X = df.drop(columns=[2])
y = df[2]

sup_vec = SuperVectorizer()

pipeline = make_pipeline(
    SuperVectorizer(auto_cast=True, sparse_threshold=0.3),
    HistGradientBoostingRegressor()
)

cross_val_score(pipeline, X, y)

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
GaelVaroquauxcommented, Mar 31, 2022
Wouldn't it be redundant with sparse_threshold ?

I agree. The question is whether to keep one option or the other.

Let’s be consistent with upstream.

1reaction
LilianBoulardcommented, Mar 31, 2022

However, with this default value of sparse_threshold, the implementation of the pipeline will very often lead to an error. This is because the creation of sparse matrices seems very likely.

I see, you’re right. This is the default value used in sklearn’s ColumnTransformer, which we inherit. It might be interesting to know why they settled on this one !

Another alternative would be to have a sparse= option such as here.

Wouldn’t it be redundant with sparse_threshold ?

there is no mention of this option on this page, so we should maybe add it?

Indeed, that’s a problem ! Its description is missing from the SuperVectorizer’s docstring. We could either copy-paste the section from the ColumnTransformer’s, or redirect the user. I guess our top priority is user experience, so I’d go for the former (copy-paste missing doc from CT).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Chapter 4. Text Vectorization and Transformation Pipelines
The vectorizer returns a sparse matrix representation in the form of ((doc, term), tfidf) where each key is a document and term pair...
Read more >
scipy.sparse.csr_matrix — SciPy v1.9.3 Manual
Sparse matrices can be used in arithmetic operations: they support addition, subtraction, multiplication, division, and matrix power.
Read more >
Vectorizing Sparse Matrix Computations with Partially-Strided ...
Abstract—The compact data structures and irregular compu- tation patterns in sparse matrix computations introduce chal- lenges to vectorizing these codes.
Read more >
sklearn.feature_extraction.DictVectorizer
This transformer turns lists of mappings (dict-like objects) of feature names to feature values into Numpy arrays or scipy.sparse matrices for use with ......
Read more >
Sparse Matrices For Efficient Machine Learning
All in all, converting sparse matrices to the sparse matrix format almost always yields some efficiency in processing time. We saw this to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found