question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: Regression with StandardScaler due to #19527

See original GitHub issue

Describe the bug

#19527 introduced a regression with StandardScaler when dealing with data with small magnitudes.

Steps/Code to Reproduce

In MNE-Python some of our data channels have magnitudes in the ~1e-13 range. On 638b7689bbbfae4bcc4592c6f8a43ce86b571f0b or before, this code (which uses random data of different scales) returns all True, which seems correct:

import numpy as np
from sklearn.preprocessing import StandardScaler

for scale in (1e15, 1e10, 1e5, 1, 1e-5, 1e-10, 1e-15):
    data = np.random.RandomState(0).rand(1000, 4) - 0.5
    data *= scale
    scaler = StandardScaler(with_mean=True, with_std=True)
    X = scaler.fit_transform(data)
    stds = np.std(data, axis=0)
    means = np.mean(data, axis=0)
    print(np.allclose(X, (data - means) / stds, rtol=1e-7, atol=1e-7 * scale))

But on c748e465c76c43a173ad5ab2fd82639210f8e895 / after #19527, anything “too small” starts to fail, as I get 5 True and the last two scale factors (1e-10, 1e-15) False. Hence StandardScaler no longer standardizes the data.

cc @ogrisel since this came from your PR and @maikia @rth @agramfort since you approved the PR

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
jeremiedbbcommented, Mar 25, 2021

The computed variance is accurate up to some precision. When the variance is very small and the computed value lies within the error bounds of the algorithm used to compute it, the returned value can’t be trusted. In that case, the relative error on the computed variance is > 100% and thus is undistinguishable from a 0 variance. I think having a threshold where the variance is considered 0 is correct (provided that the threshold corresponds to the error bounds).

I think it will not break code that was not correct in the first place, e.g. relying on a computed variance smaller than the error bounds of the algorithm.

However the bound set in #19527 is far too big. For the current implementation of the variance, it should be n_samples**3 * mean(x)**2 * eps**2. But I found that we can improve the accuracy of our implementation with a small change to have an upper bound of n_samples**4 * mean(x)**3 * eps**3. I’m working on this.

0reactions
ogriselcommented, Mar 30, 2021

Very nice! Thanks so much @jeremiedbb for taking the time to understand the root cause of the problem and design this fix.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Prevent scalers to scale near-constant features very large ...
The original problem happens when fitting StandardScaler(with_mean=False) with sample_weight on ... BUG: Regression with StandardScaler due to #19527 #19726.
Read more >
Version 1.0.2 — scikit-learn 1.2.0 documentation
This fixes a regression introduced in 1.0.0 with respect to 0.24.2. ... This often occurs due to changes in the modelling logic (bug...
Read more >
Software regression - Wikipedia
A software regression is a type of software bug where a feature that has worked before stops working. This may happen after changes...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found