BUG: Regression with StandardScaler due to #19527
See original GitHub issueDescribe the bug
#19527 introduced a regression with StandardScaler when dealing with data with small magnitudes.
Steps/Code to Reproduce
In MNE-Python some of our data channels have magnitudes in the ~1e-13 range. On 638b7689bbbfae4bcc4592c6f8a43ce86b571f0b or before, this code (which uses random data of different scales) returns all True, which seems correct:
import numpy as np
from sklearn.preprocessing import StandardScaler
for scale in (1e15, 1e10, 1e5, 1, 1e-5, 1e-10, 1e-15):
data = np.random.RandomState(0).rand(1000, 4) - 0.5
data *= scale
scaler = StandardScaler(with_mean=True, with_std=True)
X = scaler.fit_transform(data)
stds = np.std(data, axis=0)
means = np.mean(data, axis=0)
print(np.allclose(X, (data - means) / stds, rtol=1e-7, atol=1e-7 * scale))
But on c748e465c76c43a173ad5ab2fd82639210f8e895 / after #19527, anything “too small” starts to fail, as I get 5 True and the last two scale factors (1e-10, 1e-15) False. Hence StandardScaler
no longer standardizes the data.
cc @ogrisel since this came from your PR and @maikia @rth @agramfort since you approved the PR
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (14 by maintainers)
Top Results From Across the Web
Prevent scalers to scale near-constant features very large ...
The original problem happens when fitting StandardScaler(with_mean=False) with sample_weight on ... BUG: Regression with StandardScaler due to #19527 #19726.
Read more >Version 1.0.2 — scikit-learn 1.2.0 documentation
This fixes a regression introduced in 1.0.0 with respect to 0.24.2. ... This often occurs due to changes in the modelling logic (bug...
Read more >Software regression - Wikipedia
A software regression is a type of software bug where a feature that has worked before stops working. This may happen after changes...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The computed variance is accurate up to some precision. When the variance is very small and the computed value lies within the error bounds of the algorithm used to compute it, the returned value can’t be trusted. In that case, the relative error on the computed variance is > 100% and thus is undistinguishable from a 0 variance. I think having a threshold where the variance is considered 0 is correct (provided that the threshold corresponds to the error bounds).
I think it will not break code that was not correct in the first place, e.g. relying on a computed variance smaller than the error bounds of the algorithm.
However the bound set in #19527 is far too big. For the current implementation of the variance, it should be
n_samples**3 * mean(x)**2 * eps**2
. But I found that we can improve the accuracy of our implementation with a small change to have an upper bound ofn_samples**4 * mean(x)**3 * eps**3
. I’m working on this.Very nice! Thanks so much @jeremiedbb for taking the time to understand the root cause of the problem and design this fix.