Add Sparse Matrix Support For HistGradientBoostingClassifier
See original GitHub issueDescription
Hi!
I’m receiving the error below when attempting to pass a sparse matrix to HistGradientBoostingClassifier
. The matrix is the result of using CountVectorizer
and TfidfTransformer
on input text.
In my case, the size of the text prohibits converting the sparse matrix to a dense one (I run out of memory).
Steps/Code to Reproduce
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import HistGradientBoostingClassifier
df = pd.read_csv(...)
vectorizer = CountVectorizer()
tfidf = TfidfTransformer()
clf = HistGradientBoostingClassifier()
vecs = vectorizer.fit_transform(df.loc[:, "very_large_text"])
vecs = tfidf.fit_transform(vecs)
clf.fit(vecs, df.loc[:, "label"])
Expected Results
No error is thrown.
Actual Results
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
Versions
System: python: 3.7.3 (default, Oct 1 2019, 18:28:53) [GCC 5.4.0 20160609] executable: /local_disk0/pythonVirtualEnvDirs/virtualEnv-3631eab5-084b-4139-952e-5aff594ac1bb/bin/python machine: Linux-4.15.0-1050-azure-x86_64-with-debian-stretch-sid
Python deps: pip: 19.0.3 setuptools: 40.8.0 sklearn: 0.21.3 numpy: 1.16.2 scipy: 1.2.1 Cython: 0.29.6 pandas: 0.24.2
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:5 (4 by maintainers)
For ref I had noted some implem suggestions in https://github.com/scikit-learn/scikit-learn/issues/16885
I believe @StealthyKamereon wants to give it a shot.
Regarding semantics of zeros: we can have a boolean parameter
zero_as_missing
as LightGBM. For a first version, this is not necessary though, and we should treat zeros as literal zeros for the PR to be as small as possible.Following what you said regarding semantics of zeros, I think in addition to the
zero_as_missing
parameter there should be acategorical_missing_values
which would set the missing values for categorical features. Or maybe something likezero_as: str or list of ndarray of shape (n_cats,), default="missing"