question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add Sparse Matrix Support For HistGradientBoostingClassifier

See original GitHub issue

Description

Hi!

I’m receiving the error below when attempting to pass a sparse matrix to HistGradientBoostingClassifier. The matrix is the result of using CountVectorizer and TfidfTransformer on input text.

In my case, the size of the text prohibits converting the sparse matrix to a dense one (I run out of memory).

Steps/Code to Reproduce

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import HistGradientBoostingClassifier

df = pd.read_csv(...)

vectorizer = CountVectorizer()
tfidf = TfidfTransformer()
clf = HistGradientBoostingClassifier()

vecs = vectorizer.fit_transform(df.loc[:, "very_large_text"])
vecs = tfidf.fit_transform(vecs)

clf.fit(vecs, df.loc[:, "label"])

Expected Results

No error is thrown.

Actual Results

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

Versions

System: python: 3.7.3 (default, Oct 1 2019, 18:28:53) [GCC 5.4.0 20160609] executable: /local_disk0/pythonVirtualEnvDirs/virtualEnv-3631eab5-084b-4139-952e-5aff594ac1bb/bin/python machine: Linux-4.15.0-1050-azure-x86_64-with-debian-stretch-sid

Python deps: pip: 19.0.3 setuptools: 40.8.0 sklearn: 0.21.3 numpy: 1.16.2 scipy: 1.2.1 Cython: 0.29.6 pandas: 0.24.2

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:4
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
NicolasHugcommented, Dec 23, 2020

For ref I had noted some implem suggestions in https://github.com/scikit-learn/scikit-learn/issues/16885

I believe @StealthyKamereon wants to give it a shot.

Regarding semantics of zeros: we can have a boolean parameter zero_as_missing as LightGBM. For a first version, this is not necessary though, and we should treat zeros as literal zeros for the PR to be as small as possible.

0reactions
StealthyKamereoncommented, Dec 23, 2020

Following what you said regarding semantics of zeros, I think in addition to the zero_as_missing parameter there should be a categorical_missing_values which would set the missing values for categorical features. Or maybe something like zero_as: str or list of ndarray of shape (n_cats,), default="missing"

Read more comments on GitHub >

github_iconTop Results From Across the Web

Re: [scikit-learn] Sparse Input for HistGradientBoostingClassifier
Feel free to open an issue regarding sparse support for ... The > error is: > > TypeError: A sparse matrix was passed,...
Read more >
Scikit-Learn's Pipeline: A sparse matrix was passed, but ...
Unfortunately those two are incompatible. A CountVectorizer produces a sparse matrix and the RandomForestClassifier requires a dense matrix.
Read more >
TfIdf and sparse matrices — sklearn-onnx 1.11.2 documentation
Sparse matrices do not consider null and missing values as they are not present ... try: from sklearn.ensemble import HistGradientBoostingClassifier except ...
Read more >
TfIdf and sparse matrices — sklearn-onnx 1.13 documentation
Sparse matrices do not consider null and missing values as they are not present ... try: from sklearn.ensemble import HistGradientBoostingClassifier except ...
Read more >
sklearn.ensemble.GradientBoostingClassifier
Apply trees in the ensemble to X, return leaf indices. New in version 0.17. Parameters: X{array-like, sparse matrix} of shape (n_samples, n_features).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found