Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incoherent results of supposedly reproducible piece of code

See original GitHub issue

I was giving a tutorial these days when someone stopped me because random_state=0 was passed in the train_test_split() but their score did not match mine. I asked all the participants for their score. Some had the same, up to the last decimal, and the others had slightly different answers.

Here is the piece of code executed:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, 
    stratify=data.target, random_state=0)
lr = LogisticRegression().fit(X_train, y_train)
score = lr.score(X_test, y_test)

I have no idea if this behavior is to be expected since I did not set the random_state in the LogisticRegression(). The doc says it is not needed for the ‘lbfgs’ solver, which is the default value in all the versions tested here.

Since this kind of reproducibility errors cannot be easily tested with CI, I asked them for their system information. The report with the scores obtained can be found below.

Python version: 3.8.5 | packaged by conda-forge | (default, Sep 24 2020, 16:20:24) [MSC v.1916 64 bit (AMD64)]
Numpy version: 1.18.5
Scikit-learn version: 0.23.2
System: Windows
Release: 10
Version: 10.0.17763
Machine: AMD64
Processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel

Score of sklearn code: 0.9370629370629371

Python version: 3.8.5 | packaged by conda-forge | (default, Sep 16 2020, 17:19:16) [MSC v.1916 64 bit (AMD64)]
Numpy version: 1.18.5
Scikit-learn version: 0.23.2
System: Windows
Release: 10
Version: 10.0.17763
Machine: AMD64
Processor: Intel64 Family 6 Model 142 Stepping 12, GenuineIntel

Score of sklearn code: 0.9440559440559441

Score of sklearn code: 0.9300699300699301

Python version: 3.8.5 | packaged by conda-forge | (default, Aug 29 2020, 00:43:28) [MSC v.1916 64 bit (AMD64)]
Numpy version: 1.19.1
Scikit-learn version: 0.23.2
System: Windows
Release: 10
Version: 10.0.17763
Machine: AMD64
Processor: Intel64 Family 6 Model 142 Stepping 12, GenuineIntel

Score of sklearn code: 0.9230769230769231

Python version: 3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
Numpy version: 1.18.5
Scikit-learn version: 0.23.1
System: Windows
Release: 10
Version: 10.0.17763
Machine: AMD64
Processor: Intel64 Family 6 Model 158 Stepping 13, GenuineIntel

Score of sklearn code: 0.9300699300699301

Python version: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 18:22:52) [MSC v.1916 64 bit (AMD64)]
Numpy version: 1.19.1
Scikit-learn version: 0.23.2
System: Windows
Release: 10
Version: 10.0.17763
Machine: AMD64
Processor: Intel64 Family 6 Model 158 Stepping 13, GenuineIntel

Score of sklearn code: 0.9300699300699301

Python version: 3.8.5 | packaged by conda-forge | (default, Aug 29 2020, 00:43:28) [MSC v.1916 64 bit (AMD64)]
Numpy version: 1.18.5
Scikit-learn version: 0.23.2
System: Windows
Release: 10
Version: 10.0.17763
Machine: AMD64
Processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel

Score of sklearn code: 0.9370629370629371

Python version: 3.7.4 (default, Sep 18 2019, 19:37:15)
[Clang 10.0.1 (clang-1001.0.46.4)]
Numpy version: 1.18.4
Scikit-learn version: 0.23.2
System: Darwin
Release: 18.7.0
Version: Darwin Kernel Version 18.7.0: Mon Aug 31 20:53:32 PDT 2020; root:xnu-4903.278.44~1/RELEASE_X86_64
Machine: x86_64
Processor: i386

Score of sklearn code: 0.9300699300699301

_Originally posted by @aboucaud in https://github.com/scikit-learn/scikit-learn/issues/7139#issuecomment-705788880_

Issue Analytics

State:
Created 3 years ago
Comments:12 (12 by maintainers)

Top GitHub Comments

2reactions

aboucaudcommented, Nov 17, 2020

The results I got back from 5 of the participants so far tend to confirm @glemaitre hunch. The 6 scores, including mine, are all the same: 0.958041… while they were different on the previous test without scaling.

Thanks @glemaitre and @NicolasHug !

1reaction

NicolasHugcommented, Nov 14, 2020

I get convergence warning too, and no warning when standardizing the data.

I also get consistent results across executions in both cases. There would be a problem if we didn’t get reproducible results across executions on the same machine, but slight variations across machines are to be expected especially when the model didn’t converge.

I’ll close the issue, feel free to re-open if you observed a strong signal indicating a problem here