DOC Improve doc of n_quantiles in QuantileTransformer
See original GitHub issueDescription
The QuantileTransformer
uses numpy.percentile(X_train, .) as the estimator of the quantile function of the training data. To know this function perfectly we just need to take n_quantiles=n_samples
. Then it is just a linear interpolation (which is done in the code afterwards). Therefore I don’t think we should be able to choose n_quantiles > n_samples
and we should prevent users from thinking that the higher n_quantiles
the better the transformation. As mentioned by @GaelVaroquaux IRL it is however true that it can be relevant to choose n_quantiles < n_samples
when n_samples
is very large.
I suggest to add more information on the impact of n_quantiles
in the doc which currently reads:
Number of quantiles to be computed. It corresponds to the number of
landmarks used to discretize the cumulative distribution function.
For example using 100 times more landmarks result in the same transformation
import numpy as np
from sklearn.preprocessing import QuantileTransformer
from sklearn.utils.testing import assert_allclose
n_samples = 100
X_train = np.random.randn(n_samples, 2)
X_test = np.random.randn(1000, 2)
qf_1 = QuantileTransformer(n_quantiles=n_samples)
qf_1.fit(X_train)
X_trans_1 = qf_1.transform(X_test)
qf_2 = QuantileTransformer(n_quantiles=10000)
qf_2.fit(X_train)
X_trans_2 = qf_2.transform(X_test)
assert_allclose(X_trans_1, X_trans_2)
Interestingly if you do not choose n_quantiles > n_samples
correctly, the linear interpolation done afterwards does not correspond to the numpy.percentile(X_train, .) estimator. This is not “wrong” as these are only estimators of the true quantile function/cdf but I think it is confusing and would be better to stick with the original estimator. For instance, the following raises an AssertionError.
import numpy as np
from sklearn.preprocessing import QuantileTransformer
from sklearn.utils.testing import assert_allclose
n_samples = 100
X_train = np.random.randn(n_samples, 2)
X_test = np.random.randn(1000, 2)
qf_1 = QuantileTransformer(n_quantiles=n_samples)
qf_1.fit(X_train)
X_trans_1 = qf_1.transform(X_test)
qf_2 = QuantileTransformer(n_quantiles=200)
qf_2.fit(X_train)
X_trans_2 = qf_2.transform(X_test)
assert_allclose(X_trans_1, X_trans_2)
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
+1 for dynamically downgrading n_quantiles to “self.n_quantiles_ = min(n_quantiles, n_samples)” maybe with a warning.
However, -1 for raising an error: people might not know in advance what the sample is.
Sounds good! I will open a PR.