Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: Regression with StandardScaler due to #19527

See original GitHub issue

Describe the bug

#19527 introduced a regression with StandardScaler when dealing with data with small magnitudes.

Steps/Code to Reproduce

In MNE-Python some of our data channels have magnitudes in the ~1e-13 range. On 638b7689bbbfae4bcc4592c6f8a43ce86b571f0b or before, this code (which uses random data of different scales) returns all True, which seems correct:

import numpy as np
from sklearn.preprocessing import StandardScaler

for scale in (1e15, 1e10, 1e5, 1, 1e-5, 1e-10, 1e-15):
    data = np.random.RandomState(0).rand(1000, 4) - 0.5
    data *= scale
    scaler = StandardScaler(with_mean=True, with_std=True)
    X = scaler.fit_transform(data)
    stds = np.std(data, axis=0)
    means = np.mean(data, axis=0)
    print(np.allclose(X, (data - means) / stds, rtol=1e-7, atol=1e-7 * scale))

But on c748e465c76c43a173ad5ab2fd82639210f8e895 / after #19527, anything “too small” starts to fail, as I get 5 True and the last two scale factors (1e-10, 1e-15) False. Hence StandardScaler no longer standardizes the data.

cc @ogrisel since this came from your PR and @maikia @rth @agramfort since you approved the PR

Issue Analytics

State:
Created 3 years ago
Comments:14 (14 by maintainers)

Top GitHub Comments

1reaction

jeremiedbbcommented, Mar 25, 2021

The computed variance is accurate up to some precision. When the variance is very small and the computed value lies within the error bounds of the algorithm used to compute it, the returned value can’t be trusted. In that case, the relative error on the computed variance is > 100% and thus is undistinguishable from a 0 variance. I think having a threshold where the variance is considered 0 is correct (provided that the threshold corresponds to the error bounds).

I think it will not break code that was not correct in the first place, e.g. relying on a computed variance smaller than the error bounds of the algorithm.

However the bound set in #19527 is far too big. For the current implementation of the variance, it should be n_samples**3 * mean(x)**2 * eps**2. But I found that we can improve the accuracy of our implementation with a small change to have an upper bound of n_samples**4 * mean(x)**3 * eps**3. I’m working on this.

0reactions

ogriselcommented, Mar 30, 2021

Very nice! Thanks so much @jeremiedbb for taking the time to understand the root cause of the problem and design this fix.

Top Results From Across the Web

Prevent scalers to scale near-constant features very large ...

The original problem happens when fitting StandardScaler(with_mean=False) with sample_weight on ... BUG: Regression with StandardScaler due to #19527 #19726.