KBinsDiscretizer creates wrong bins
See original GitHub issueDescribe the bug
when binning many identical values, KBinsDiscretizer
fails to create the appropriate number of bins, complaining
UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 0 are removed.
Steps/Code to Reproduce
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
X = np.array([0,0,0,0,0,1,1,1,2,3]).reshape((-1,1))
bd = KBinsDiscretizer(n_bins=3, strategy="quantile", encode="ordinal")
bd.fit_transform(X).T
Expected Results
- No errors or warnings.
- The last expression should evaluate to
np.array([[0., 0., 0., 0., 0., 1., 1., 1., 2., 2.]])
Actual Results
- Warning:
sklearn/preprocessing/_discretization.py:220: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 0 are removed. Consider decreasing the number of bins.
- The last expression evaluates to
np.array([[0., 0., 0., 0., 0., 1., 1., 1., 1., 1.]])
Versions
System:
python: 3.9.1 (default, Feb 3 2021, 07:04:15) [Clang 12.0.0 (clang-1200.0.32.29)]
executable: /.../.virtualenvs/.../bin/python3
machine: macOS-10.15.7-x86_64-i386-64bit
Python dependencies:
pip: 21.0.1
setuptools: 53.0.0
sklearn: 0.24.1
numpy: 1.20.1
scipy: 1.6.0
Cython: None
pandas: 1.2.1
matplotlib: 3.3.4
joblib: 1.0.0
threadpoolctl: 2.1.0
Built with OpenMP: True
PS. could be related to #18638
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (13 by maintainers)
Top Results From Across the Web
KBinsDiscretizer bin edges - python - Stack Overflow
The edges are defined using np.linspace but the assignment is done using np.digitize followed by a np.clip to rein in the right most...
Read more >sklearn.preprocessing.KBinsDiscretizer
KBinsDiscretizer might produce constant features (e.g., when encode = 'onehot' and certain bins do not contain any data). These features can be removed...
Read more >Intuition for Binning, KBinsDiscretizer - 16: Scikit-learn 13
The video discusses the intuition behind binning and KBinsDiscretizer in Scikit-learn in Python.Timeline(Python 3.8)00:00 - Outline of ...
Read more >Full-on D.S. Approach to Titanic: VotingClassifier | Kaggle
After having created the helper functions, it is quite straight-forward to model ... else: bins = KBinsDiscretizer() dummy = OneHotEncoder(sparse = False, ...
Read more >Discretization discretization also known as - Course Hero
Import theKBinsDiscretizerclass andcreate a new instance with three bins, ... let's create an extra binary feature withTruefor positive values andFalsefor ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It seems like the problem is because of the strategy argument. The
quantile
strategy tries to have an equal number of samples in each bin. If you change the strategy touniform
instead, it matches the output you have shown.I am chagrined that you disagree that this is a bug, but I appreciate your candidness. I guess I will have to work around it.
I understand that it is hard to implement, i tried it myself 😭
This works well enough for my cases (where the data is either result of
value_counts
or a continuous sample.)