Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KBinsDiscretizer creates wrong bins

See original GitHub issue

Describe the bug

when binning many identical values, KBinsDiscretizer fails to create the appropriate number of bins, complaining

UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 0 are removed.

Steps/Code to Reproduce

import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
X = np.array([0,0,0,0,0,1,1,1,2,3]).reshape((-1,1))
bd = KBinsDiscretizer(n_bins=3, strategy="quantile", encode="ordinal")
bd.fit_transform(X).T

Expected Results

No errors or warnings.
The last expression should evaluate to np.array([[0., 0., 0., 0., 0., 1., 1., 1., 2., 2.]])

Actual Results

Warning:

sklearn/preprocessing/_discretization.py:220: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 0 are removed. Consider decreasing the number of bins.

The last expression evaluates to np.array([[0., 0., 0., 0., 0., 1., 1., 1., 1., 1.]])

Versions

System:
    python: 3.9.1 (default, Feb  3 2021, 07:04:15)  [Clang 12.0.0 (clang-1200.0.32.29)]
executable: /.../.virtualenvs/.../bin/python3
   machine: macOS-10.15.7-x86_64-i386-64bit

Python dependencies:
          pip: 21.0.1
   setuptools: 53.0.0
      sklearn: 0.24.1
        numpy: 1.20.1
        scipy: 1.6.0
       Cython: None
       pandas: 1.2.1
   matplotlib: 3.3.4
       joblib: 1.0.0
threadpoolctl: 2.1.0

Built with OpenMP: True

PS. could be related to #18638

Issue Analytics

State:
Created 3 years ago
Comments:13 (13 by maintainers)

Top GitHub Comments

4reactions

azihnacommented, Feb 10, 2021

It seems like the problem is because of the strategy argument. The quantile strategy tries to have an equal number of samples in each bin. If you change the strategy to uniform instead, it matches the output you have shown.

import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
X = np.array([0,0,0,0,0,1,1,1,2,3]).reshape((-1,1))
bd = KBinsDiscretizer(n_bins=3, 
                      strategy="uniform", 
                      encode="ordinal")
bd.fit_transform(X).T
>>> array([[0., 0., 0., 0., 0., 1., 1., 1., 2., 2.]])

1reaction

sam-scommented, Feb 17, 2021

I am chagrined that you disagree that this is a bug, but I appreciate your candidness. I guess I will have to work around it.

I understand that it is hard to implement, i tried it myself 😭

def only_column(df):
    "Return the only column of df or raise an exception."
    if df.shape[1] != 1:
        raise ValueError("only_column",df.shape)
    try:
        return df[df.columns[0]]
    except AttributeError:      # numpy
        return df[:,0]

class PercentileDiscretizer(TransformerMixin, BaseEstimator):
    "Discretize numbers by percentiles."
    def as_dict(self):
        "Convert to a dict for saving to file."
        try:
            return dict(kind=self.__class__.__name__, edges=self.get_edges())
        except AttributeError:  # before fit()
            return dict(kind=self.__class__.__name__)

    def get_edges(self):
        "Extract percentile edges."
        raise NotImplementedError("PercentileDiscretizer.get_edges")

class PercentileDiscretizerNP(PercentileDiscretizer):
    """Discretize numbers by percentiles.
    My implementation using numpy to avoid
    https://github.com/scikit-learn/scikit-learn/issues/19416"""
    def __init__(self, n_bins=10):
        self.n_bins = n_bins
        self.edges = None
        self.right = False

    def __str__(self):
        if self.edges is None:  # before fit()
            return "%s(n=%s)" % (self.__class__.__name__, self.n_bins)
        return "%s(n=%d, e=%d)" % (
            self.__class__.__name__, self.n_bins, len(self.edges))

    def fit(self, X, _y=None):
        "Learn how to discretize the data."
        # https://numpy.org/doc/stable/reference/generated/numpy.percentile.html
        col = only_column(X)
        self.edges = np.unique(np.percentile(
            a=col, q=np.arange(1,self.n_bins)*100/self.n_bins, interpolation="midpoint"))
        if self.edges[0] == col.min():  # => no 0 bin
            self.right = True
        if len(self.edges) < 2 and not self.right:
            raise ValueError("PercentileDiscretizerNP.fit: too few edges", {
                "edges": list(self.edges), "x_shape": X.shape,
                "value_counts": dict(zip(*np.unique(col, return_counts=True)))})
        return self

    def transform(self, X):
        "Apply the discretizer to data."
        # https://numpy.org/doc/stable/reference/generated/numpy.digitize.html
        return np.digitize(x=X, bins=self.edges, right=self.right)

    def get_edges(self):
        if self.edges is None:  # before fit()
            raise AttributeError("PercentileDiscretizerNP.get_edges: call fit first")
        return list(self.edges)

This works well enough for my cases (where the data is either result of value_counts or a continuous sample.)

Top Results From Across the Web

KBinsDiscretizer bin edges - python - Stack Overflow

The edges are defined using np.linspace but the assignment is done using np.digitize followed by a np.clip to rein in the right most...

sklearn.preprocessing.KBinsDiscretizer

KBinsDiscretizer might produce constant features (e.g., when encode = 'onehot' and certain bins do not contain any data). These features can be removed...

Intuition for Binning, KBinsDiscretizer - 16: Scikit-learn 13

The video discusses the intuition behind binning and KBinsDiscretizer in Scikit-learn in Python.Timeline(Python 3.8)00:00 - Outline of ...

Full-on D.S. Approach to Titanic: VotingClassifier | Kaggle

After having created the helper functions, it is quite straight-forward to model ... else: bins = KBinsDiscretizer() dummy = OneHotEncoder(sparse = False, ...

Discretization discretization also known as - Course Hero

Import theKBinsDiscretizerclass andcreate a new instance with three bins, ... let's create an extra binary feature withTruefor positive values andFalsefor ...