question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KBinsDiscretizer creates wrong bins

See original GitHub issue

Describe the bug

when binning many identical values, KBinsDiscretizer fails to create the appropriate number of bins, complaining

UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 0 are removed.

Steps/Code to Reproduce

import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
X = np.array([0,0,0,0,0,1,1,1,2,3]).reshape((-1,1))
bd = KBinsDiscretizer(n_bins=3, strategy="quantile", encode="ordinal")
bd.fit_transform(X).T

Expected Results

  1. No errors or warnings.
  2. The last expression should evaluate to np.array([[0., 0., 0., 0., 0., 1., 1., 1., 2., 2.]])

Actual Results

  1. Warning:

sklearn/preprocessing/_discretization.py:220: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 0 are removed. Consider decreasing the number of bins.

  1. The last expression evaluates to np.array([[0., 0., 0., 0., 0., 1., 1., 1., 1., 1.]])

Versions

System:
    python: 3.9.1 (default, Feb  3 2021, 07:04:15)  [Clang 12.0.0 (clang-1200.0.32.29)]
executable: /.../.virtualenvs/.../bin/python3
   machine: macOS-10.15.7-x86_64-i386-64bit

Python dependencies:
          pip: 21.0.1
   setuptools: 53.0.0
      sklearn: 0.24.1
        numpy: 1.20.1
        scipy: 1.6.0
       Cython: None
       pandas: 1.2.1
   matplotlib: 3.3.4
       joblib: 1.0.0
threadpoolctl: 2.1.0

Built with OpenMP: True

PS. could be related to #18638

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

4reactions
azihnacommented, Feb 10, 2021

It seems like the problem is because of the strategy argument. The quantile strategy tries to have an equal number of samples in each bin. If you change the strategy to uniform instead, it matches the output you have shown.

import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
X = np.array([0,0,0,0,0,1,1,1,2,3]).reshape((-1,1))
bd = KBinsDiscretizer(n_bins=3, 
                      strategy="uniform", 
                      encode="ordinal")
bd.fit_transform(X).T
>>> array([[0., 0., 0., 0., 0., 1., 1., 1., 2., 2.]])
1reaction
sam-scommented, Feb 17, 2021

I am chagrined that you disagree that this is a bug, but I appreciate your candidness. I guess I will have to work around it.

I understand that it is hard to implement, i tried it myself 😭

def only_column(df):
    "Return the only column of df or raise an exception."
    if df.shape[1] != 1:
        raise ValueError("only_column",df.shape)
    try:
        return df[df.columns[0]]
    except AttributeError:      # numpy
        return df[:,0]

class PercentileDiscretizer(TransformerMixin, BaseEstimator):
    "Discretize numbers by percentiles."
    def as_dict(self):
        "Convert to a dict for saving to file."
        try:
            return dict(kind=self.__class__.__name__, edges=self.get_edges())
        except AttributeError:  # before fit()
            return dict(kind=self.__class__.__name__)

    def get_edges(self):
        "Extract percentile edges."
        raise NotImplementedError("PercentileDiscretizer.get_edges")

class PercentileDiscretizerNP(PercentileDiscretizer):
    """Discretize numbers by percentiles.
    My implementation using numpy to avoid
    https://github.com/scikit-learn/scikit-learn/issues/19416"""
    def __init__(self, n_bins=10):
        self.n_bins = n_bins
        self.edges = None
        self.right = False

    def __str__(self):
        if self.edges is None:  # before fit()
            return "%s(n=%s)" % (self.__class__.__name__, self.n_bins)
        return "%s(n=%d, e=%d)" % (
            self.__class__.__name__, self.n_bins, len(self.edges))

    def fit(self, X, _y=None):
        "Learn how to discretize the data."
        # https://numpy.org/doc/stable/reference/generated/numpy.percentile.html
        col = only_column(X)
        self.edges = np.unique(np.percentile(
            a=col, q=np.arange(1,self.n_bins)*100/self.n_bins, interpolation="midpoint"))
        if self.edges[0] == col.min():  # => no 0 bin
            self.right = True
        if len(self.edges) < 2 and not self.right:
            raise ValueError("PercentileDiscretizerNP.fit: too few edges", {
                "edges": list(self.edges), "x_shape": X.shape,
                "value_counts": dict(zip(*np.unique(col, return_counts=True)))})
        return self

    def transform(self, X):
        "Apply the discretizer to data."
        # https://numpy.org/doc/stable/reference/generated/numpy.digitize.html
        return np.digitize(x=X, bins=self.edges, right=self.right)

    def get_edges(self):
        if self.edges is None:  # before fit()
            raise AttributeError("PercentileDiscretizerNP.get_edges: call fit first")
        return list(self.edges)

This works well enough for my cases (where the data is either result of value_counts or a continuous sample.)

Read more comments on GitHub >

github_iconTop Results From Across the Web

KBinsDiscretizer bin edges - python - Stack Overflow
The edges are defined using np.linspace but the assignment is done using np.digitize followed by a np.clip to rein in the right most...
Read more >
sklearn.preprocessing.KBinsDiscretizer
KBinsDiscretizer might produce constant features (e.g., when encode = 'onehot' and certain bins do not contain any data). These features can be removed...
Read more >
Intuition for Binning, KBinsDiscretizer - 16: Scikit-learn 13
The video discusses the intuition behind binning and KBinsDiscretizer in Scikit-learn in Python.Timeline(Python 3.8)00:00 - Outline of ...
Read more >
Full-on D.S. Approach to Titanic: VotingClassifier | Kaggle
After having created the helper functions, it is quite straight-forward to model ... else: bins = KBinsDiscretizer() dummy = OneHotEncoder(sparse = False, ...
Read more >
Discretization discretization also known as - Course Hero
Import theKBinsDiscretizerclass andcreate a new instance with three bins, ... let's create an extra binary feature withTruefor positive values andFalsefor ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found