Super Vectorizer transforms data to sparse matrices
See original GitHub issueActual behavior
The Super Vectorizer transform and fit_transform methods have the following rule: “If any result is a sparse matrix, everything will be converted to sparse matrices.” This is the scipy.sparse.csr.csr_matrix type.
However, this type is not commonly accepted for further analysis. For instance, when applying a cross_val_score() we need to first make the result an array to be able to apply the method. This makes also the direct introduction of pipelines in cross_val_score() impossible, as an error will appear.
Expected behavior
Sparse matrices happen when the encoded variable has a lot of categories. Maybe introduce a sparse=True parameter, just like for the sklearn OHE, that will return sparse matrix if set True and array if False.
Easy code to reproduce bug
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.experimental import enable_hist_gradient_boosting
# now you can import the HGBR from ensemble
from sklearn.ensemble import HistGradientBoostingRegressor
from dirty_cat import SuperVectorizer
np.random.seed(444)
col1 = np.random.choice(
a=[0, 1, 2, 3],
size=50,
p=[0.4, 0.3, 0.2, 0.1])
col2 = np.random.choice(
a=['a', 'b', 'c'],
size=50,
p=[0.4, 0.4, 0.2])
results = np.random.uniform(
size=50)
df = pd.DataFrame(np.array([col1, col2, results])).transpose()
X = df.drop(columns=[2])
y = df[2]
sup_vec = SuperVectorizer()
pipeline = make_pipeline(
SuperVectorizer(auto_cast=True, sparse_threshold=0.3),
HistGradientBoostingRegressor()
)
cross_val_score(pipeline, X, y)
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:9 (9 by maintainers)
Top Results From Across the Web
Chapter 4. Text Vectorization and Transformation Pipelines
The vectorizer returns a sparse matrix representation in the form of ((doc, term), tfidf) where each key is a document and term pair...
Read more >scipy.sparse.csr_matrix — SciPy v1.9.3 Manual
Sparse matrices can be used in arithmetic operations: they support addition, subtraction, multiplication, division, and matrix power.
Read more >Vectorizing Sparse Matrix Computations with Partially-Strided ...
Abstract—The compact data structures and irregular compu- tation patterns in sparse matrix computations introduce chal- lenges to vectorizing these codes.
Read more >sklearn.feature_extraction.DictVectorizer
This transformer turns lists of mappings (dict-like objects) of feature names to feature values into Numpy arrays or scipy.sparse matrices for use with ......
Read more >Sparse Matrices For Efficient Machine Learning
All in all, converting sparse matrices to the sparse matrix format almost always yields some efficiency in processing time. We saw this to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Let’s be consistent with upstream.
I see, you’re right. This is the default value used in sklearn’s ColumnTransformer, which we inherit. It might be interesting to know why they settled on this one !
Wouldn’t it be redundant with
sparse_threshold
?Indeed, that’s a problem ! Its description is missing from the SuperVectorizer’s docstring. We could either copy-paste the section from the ColumnTransformer’s, or redirect the user. I guess our top priority is user experience, so I’d go for the former (copy-paste missing doc from CT).