question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

QuantileTransformer quantiles can be unordered because of rounding errors which cause np.interp to return nonsense results

See original GitHub issue

Description

At inference in QuantileTransformer, np.interp is used. The documentation of this function states: Does not check that the x-coordinate sequence xp is increasing. If xp is not increasing, the results are nonsense. Within QuantileTransformer xp are quantiles. To ensure that np.interp behaves correctly we must ensure that quantiles stored in self.quantiles_ are ordered i.e that np.all(np.diff(self.quantiles_, axis=0) >= 0) holds true.

I’ve found that because of rounding errors, sometimes this does not hold. It is actually a very big issue because it causes inference to behave very erratically (for instance, a sample will not be transformed the same way depending on its position within the input), it is very confusing and very hard to debug.

Steps/Code to Reproduce

Finding a minimal example is really hard, I will provide an example I’ve managed to isolate that reproduces the issue with 100% reproducibility, however since it happens because of a very tiny rounding error and this feature make use of randomness (for sampling), I hope it is not dependent on hardware.

Here is a gist that defines an array of size (300,2), I can reproduce the bug with the following code:

import numpy as np
from sklearn.preprocessing import QuantileTransformer
X = np.loadtxt('gistfile1.txt', delimiter=',')
quantile_transformer = QuantileTransformer(n_quantiles=150).fit(X)
print(np.all(np.diff(quantile_transformer.quantiles_, axis=0) >= 0))

Expected Results

The previous code should print True

Actual Results

It prints False

Versions

I have taken note of the fixes of QuantileTransformer in 21.3 (ensuring that n_quantiles <= n_samples) and I have already checked that it is unrelated. It can be seen in the minimal example that the input has 300 samples and the parameter n_quantiles is set to 150 anyway.

[GCC 5.4.0 20160609]
NumPy 1.15.4
SciPy 1.3.3
Scikit-Learn 0.19.2

Quickfix

I haven’t investigated more deeply to understand the cause of the rounding error. Here is a suggestion of a quick, dirty fix to anyone that would meet the same issue: if quantile is unordered, replace it with something like np.minimum.accumulate(quantile_transformer.quantiles_[::-1])[::-1] (i think it’s better than forcing a sort).

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
glemaitrecommented, Dec 12, 2019

A potential fix was submitted there: https://github.com/numpy/numpy/pull/15098

0reactions
fcharrascommented, Dec 3, 2019

I’ll keep an eye on the issue but I am indeed short in time, sorry. It’s awesome if you can loop.

Read more comments on GitHub >

github_iconTop Results From Across the Web

numpy interp decreasing xp - python - Stack Overflow
I have experimented a bit, and came to this conclusion: 1) Apart from rounding errors, both methods mentioned by me have the same...
Read more >
Machine Learning for particle identification in the LHCb detector
Particle identification (PID) is a crucial ingredient of most of the LHCb results. The quality of the particle identification depends a lot on...
Read more >
numpy.interp — NumPy v1.8 Manual - omz:software
Returns the one-dimensional piecewise linear interpolant to a function with given values at discrete ... If xp is not increasing, the results are...
Read more >
Hands-On Data Analysis with Pandas: A Python data science ...
This book will give you a hands-on introduction to data analysis using pandas on real-world datasets, such as those dealing with the stock...
Read more >
Packt - Hands On - Data.analysis - With.pandas.2019 - Scribd
Results will be shown without anything preceding the lines: ... Each column in our data is a random variable, because every time we....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found