question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DOC Improve doc of n_quantiles in QuantileTransformer

See original GitHub issue

Description

The QuantileTransformer uses numpy.percentile(X_train, .) as the estimator of the quantile function of the training data. To know this function perfectly we just need to take n_quantiles=n_samples. Then it is just a linear interpolation (which is done in the code afterwards). Therefore I don’t think we should be able to choose n_quantiles > n_samples and we should prevent users from thinking that the higher n_quantiles the better the transformation. As mentioned by @GaelVaroquaux IRL it is however true that it can be relevant to choose n_quantiles < n_samples when n_samples is very large.

I suggest to add more information on the impact of n_quantiles in the doc which currently reads:

Number of quantiles to be computed. It corresponds to the number of
landmarks used to discretize the cumulative distribution function.

For example using 100 times more landmarks result in the same transformation

import numpy as np
from sklearn.preprocessing import QuantileTransformer
from sklearn.utils.testing import assert_allclose

n_samples = 100
X_train = np.random.randn(n_samples, 2)
X_test = np.random.randn(1000, 2)

qf_1 = QuantileTransformer(n_quantiles=n_samples)
qf_1.fit(X_train)
X_trans_1 = qf_1.transform(X_test)

qf_2 = QuantileTransformer(n_quantiles=10000)
qf_2.fit(X_train)
X_trans_2 = qf_2.transform(X_test)

assert_allclose(X_trans_1, X_trans_2)

Interestingly if you do not choose n_quantiles > n_samples correctly, the linear interpolation done afterwards does not correspond to the numpy.percentile(X_train, .) estimator. This is not “wrong” as these are only estimators of the true quantile function/cdf but I think it is confusing and would be better to stick with the original estimator. For instance, the following raises an AssertionError.

import numpy as np
from sklearn.preprocessing import QuantileTransformer
from sklearn.utils.testing import assert_allclose

n_samples = 100
X_train = np.random.randn(n_samples, 2)
X_test = np.random.randn(1000, 2)

qf_1 = QuantileTransformer(n_quantiles=n_samples)
qf_1.fit(X_train)
X_trans_1 = qf_1.transform(X_test)

qf_2 = QuantileTransformer(n_quantiles=200)
qf_2.fit(X_train)
X_trans_2 = qf_2.transform(X_test)

assert_allclose(X_trans_1, X_trans_2)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
GaelVaroquauxcommented, Feb 28, 2019

Therefore I don’t think we should be able to choose n_quantiles > n_samples and we should prevent users from thinking that the higher n_quantiles the better the transformation.

+1 for dynamically downgrading n_quantiles to “self.n_quantiles_ = min(n_quantiles, n_samples)” maybe with a warning.

However, -1 for raising an error: people might not know in advance what the sample is.

0reactions
albertcthomascommented, Feb 28, 2019

Sounds good! I will open a PR.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.preprocessing.QuantileTransformer
Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative distribution function. If n_quantiles is...
Read more >
pyts.preprocessing.QuantileTransformer
The cumulative distribution function of a feature is used to project the original values. Note that this transform is non-linear. Parameters: n_quantiles :...
Read more >
dask_ml.preprocessing.QuantileTransformer - Dask-ML
This implementation differs from the scikit-learn implementation by using approximate quantiles. The scikit-learn docstring follows. This method transforms the ...
Read more >
How to Use Quantile Transforms for Machine Learning
Quantile transforms are a technique for transforming numerical input or output variables to have a Gaussian or uniform probability distribution.
Read more >
Data Preparation for Machine Learning - EBIN.PUB
Some of these names may better fit as sub-tasks for the broader data ... API. https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found