predict values sometimes different from top predict_proba entries?
See original GitHub issueDescribe the bug
I’m not actually sure if this is a bug or my own misunderstanding. But I would expect the max value in a row of predict_proba
to have an index exactly equal to the value in the corresponding row of predict
. However this is not what I’m seeing. Instead, the values are equal up until ~25, at which point they start to diverge by 1, then by 2 at ~50 ,and 3 at around ~115.
Steps/Code to Reproduce
If this is actually a bug I can provide actual data, but for now this is my pipeline:
def new_text_clf():
return Pipeline(
[
("vect", StemmedCountVectorizer(ngram_range=(1, 2), max_df=0.8, min_df=3),),
("tfidf", TfidfTransformer(use_idf=True)),
(
"clf",
SGDClassifier(
loss="modified_huber", penalty="l2", alpha=0.001, random_state=42
),
),
]
)
And I’m running
> train = load_files(train_path, encoding="utf-8", decode_error="replace", shuffle=True, random_state=42,)
> test = load_files(test_path, encoding="utf-8", decode_error="replace", shuffle=True, random_state=42,
> train.target_names == test.target_names
True
> text_clf = new_text_clf().fit(train.data, train.target)
> [*test.target]
[61, 13, 11, 89, 11, 71, 118, 33, 52, 57, 16, 57, 100, 24, ...]
> [*text_clf.predict(test.data)]
[61, 16, 11, 16, 11, 89, 26, 33, 16, 57, 16, 118, 11, 26, ...]
> [max(enumerate(prob), key=lambda p: p[1])[0] for prob in text_clf.predict_proba(test.data)]
[60, 16, 11, 16, 11, 87, 25, 32, 16, 56, 16, 115, 11, 25, ...]
Expected Results
I would expect [*text_clf.predict(test.data)]
to exactly equal [max(enumerate(prob), key=lambda p: p[1])[0] for prob in text_clf. predict_proba(test.data)]
Actual Results
They diverge with increasing categories.
Versions
System: python: 3.7.7 (default, Mar 10 2020, 15:43:03) [Clang 11.0.0 (clang-1100.0.33.17)] executable: /usr/local/opt/python/bin/python3.7 machine: Darwin-18.7.0-x86_64-i386-64bit
Python dependencies: pip: 20.0.2 setuptools: 46.0.0 sklearn: 0.22.2.post1 numpy: 1.18.3 scipy: 1.4.1 Cython: None pandas: None matplotlib: 3.2.1 joblib: 0.14.1
Built with OpenMP: True
Edit:
I notice len(text_clf.predict_proba(test.data)[0])
is 119 whereas len(test.target_names)
is 122. This is likely related, my guess is theres a mapping between proba indexes and target_names indexes that isn’t a simple 1:1, and it’s not being applied correctly. This is probably because 3 categories have either 0 test or train data. Not sure which. But I do think this is still a bug.
I believe the problem is this logic:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_base.py#L312
is missing from the predict_proba implementation: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_stochastic_gradient.py#L1025-L1055
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (6 by maintainers)
Top GitHub Comments
My bad, I probably read too fast. The kind of issue in the title
predict values sometimes different from top predict_proba
is one that comes up regularly, and 99% of the time it’s because of ties, and I got influenced by that.I see now that it’s documented that
classes are ordered as they are in 'self.classes_'
.Though this was surprising to me, and seemingly to you both as from the get go I said that the problem went away after passing through
classes_
and you said that that wasn’t the expected behaviour, but rather that there might be issues due to ties or numerical instability.Have a nice day.