Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DOC Improve doc of n_quantiles in QuantileTransformer

See original GitHub issue

Description

The QuantileTransformer uses numpy.percentile(X_train, .) as the estimator of the quantile function of the training data. To know this function perfectly we just need to take n_quantiles=n_samples. Then it is just a linear interpolation (which is done in the code afterwards). Therefore I don’t think we should be able to choose n_quantiles > n_samples and we should prevent users from thinking that the higher n_quantiles the better the transformation. As mentioned by @GaelVaroquaux IRL it is however true that it can be relevant to choose n_quantiles < n_samples when n_samples is very large.

I suggest to add more information on the impact of n_quantiles in the doc which currently reads:

Number of quantiles to be computed. It corresponds to the number of
landmarks used to discretize the cumulative distribution function.

For example using 100 times more landmarks result in the same transformation

import numpy as np
from sklearn.preprocessing import QuantileTransformer
from sklearn.utils.testing import assert_allclose

n_samples = 100
X_train = np.random.randn(n_samples, 2)
X_test = np.random.randn(1000, 2)

qf_1 = QuantileTransformer(n_quantiles=n_samples)
qf_1.fit(X_train)
X_trans_1 = qf_1.transform(X_test)

qf_2 = QuantileTransformer(n_quantiles=10000)
qf_2.fit(X_train)
X_trans_2 = qf_2.transform(X_test)

assert_allclose(X_trans_1, X_trans_2)

Interestingly if you do not choose n_quantiles > n_samples correctly, the linear interpolation done afterwards does not correspond to the numpy.percentile(X_train, .) estimator. This is not “wrong” as these are only estimators of the true quantile function/cdf but I think it is confusing and would be better to stick with the original estimator. For instance, the following raises an AssertionError.

import numpy as np
from sklearn.preprocessing import QuantileTransformer
from sklearn.utils.testing import assert_allclose

n_samples = 100
X_train = np.random.randn(n_samples, 2)
X_test = np.random.randn(1000, 2)

qf_1 = QuantileTransformer(n_quantiles=n_samples)
qf_1.fit(X_train)
X_trans_1 = qf_1.transform(X_test)

qf_2 = QuantileTransformer(n_quantiles=200)
qf_2.fit(X_train)
X_trans_2 = qf_2.transform(X_test)

assert_allclose(X_trans_1, X_trans_2)

Issue Analytics

State:
Created 5 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

GaelVaroquauxcommented, Feb 28, 2019

Therefore I don’t think we should be able to choose n_quantiles > n_samples and we should prevent users from thinking that the higher n_quantiles the better the transformation.

+1 for dynamically downgrading n_quantiles to “self.n_quantiles_ = min(n_quantiles, n_samples)” maybe with a warning.

However, -1 for raising an error: people might not know in advance what the sample is.

0reactions

albertcthomascommented, Feb 28, 2019

Sounds good! I will open a PR.

Top Results From Across the Web

sklearn.preprocessing.QuantileTransformer

Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative distribution function. If n_quantiles is...

pyts.preprocessing.QuantileTransformer

The cumulative distribution function of a feature is used to project the original values. Note that this transform is non-linear. Parameters: n_quantiles :...

dask_ml.preprocessing.QuantileTransformer - Dask-ML

This implementation differs from the scikit-learn implementation by using approximate quantiles. The scikit-learn docstring follows. This method transforms the ...

How to Use Quantile Transforms for Machine Learning

Quantile transforms are a technique for transforming numerical input or output variables to have a Gaussian or uniform probability distribution.

Data Preparation for Machine Learning - EBIN.PUB

Some of these names may better fit as sub-tasks for the broader data ... API. https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html