check_is_fitted gives false positive when extracted from ensemble classifier
See original GitHub issueDescribe the bug
I trained several classifiers on my dataset, and then created an ensemble classifier (voting classifier) from them. While each of the estimators, stored at .estimators_
, have been fit and used independently and within the ensemble, and even after extracting them from the ensemble, they fail a check_if_fitted
test, so I cannot use them on their own in a context that checks for fit, or in another ensemble
classifier.
Steps/Code to Reproduce
# Handle imports
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.validation import check_is_fitted
from copy import deepcopy
import numpy as np
# Generate a dummy dataset
y = np.random.choice([0, 1], size=50)
X = np.zeros((len(y), 100))
for idx, _y in enumerate(y):
X[idx, :] = 10*(np.random.random((100)) - 0.5) + int(_y)*0.75 + 20 * (np.random.random((100)) - 0.2)
yval = np.random.choice([0, 1], size=5)
Xval = np.zeros((len(yval), 100))
# Create and train classifiers across some folds
clf = Pipeline([('pca', PCA()), ('svm', SVC())])
cv = KFold(n_splits=5)
clfs = []
for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
tmpclf = deepcopy(clf)
tmpclf.fit(X_train, y_train)
clfs += [('fold{0}'.format(idx), tmpclf)]
print(tmpclf.score(X_test, y_test))
print(clfs)
# Create and initialize VotingClassifier
vclf = VotingClassifier(clfs)
vclf.estimators_ = [c[1] for c in clfs] # pass pre-fit estimators
vclf.le_ = LabelEncoder().fit(yval)
vclf.classes_ = vclf.le_.classes_
print(vclf.score(Xval, yval))
# Finally, and this is where the error occurs, extract original classifiers
orig_clf = vclf.estimators_[0]
print(orig_clf.score(Xval, yval))
check_is_fitted(orig_clf)
Expected Results
No error is thrown.
Actual Results
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
<ipython-input-77-d38acec5e641> in <module>
2
3 print(orig_clf.score(Xval, yval))
----> 4 check_is_fitted(orig_clf)
~/code/env/agg/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
~/code/env/agg/lib/python3.7/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
1017
1018 if not attrs:
-> 1019 raise NotFittedError(msg % {'name': type(estimator).__name__})
1020
1021
NotFittedError: This Pipeline instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Versions
System:
python: 3.7.3 (default, Dec 13 2019, 19:58:14) [Clang 11.0.0 (clang-1100.0.33.17)]
executable: /Users/greg/code/env/agg/bin/python3
machine: Darwin-19.6.0-x86_64-i386-64bit
Python dependencies:
pip: 20.2.3
setuptools: 50.3.0
sklearn: 0.23.2
numpy: 1.19.2
scipy: 1.5.2
Cython: None
pandas: 1.1.3
matplotlib: 3.3.2
joblib: 0.17.0
threadpoolctl: 2.1.0
Built with OpenMP: True
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (6 by maintainers)
Top Results From Across the Web
1.11. Ensemble methods — scikit-learn 1.2.0 documentation
This is an array with shape (n_features,) whose values are positive and sum to ... The module sklearn.ensemble provides methods for both classification...
Read more >Reduce false positive in extremely high imbalance testing set
I know the FPR is high but it gives an acceptable number of FP in training and testing of balanced dataset and I...
Read more >binary classification target specifically on false positive
The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values ...
Read more >Ensemble Machine Learning Algorithms in Python with scikit ...
Ensembles can give you a boost in accuracy on your dataset. In this post you will discover how you can create some of...
Read more >Ensemble Learning in extensive details with examples in ...
A hard voting classifier just counts the votes of each classifier in the ensemble and picks the class that gets the most votes....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hey @gkiar,
The issue is unrelated with the ensemble, but with the
Pipeline
(meta-)estimator which has not fitted attributes (ending with a trailing underscore).The following code snippet fails:
I have found the following comment in the
Pipeline
(meta-)estimator code:https://github.com/scikit-learn/scikit-learn/blob/5d5329c473791c90ebc58b4a18a923d1e6c216b9/sklearn/pipeline.py#L263-L264
I think that we should change
self.steps
byself.steps_
to address this comment. Thus, thePipeline
(meta-)estimator would provide fitted attributes and we should solve this issue.Before confirming this is a bug, I would like to know the opinion of a core-developer (pinging @glemaitre that previously worked in the
Pipeline
(meta-)estimator).With
scikit-learn
1.0, we introduced a new__sklearn_is_fitted__
API that is currently being used byPipeline
to denote if it is fitted. The advantage of this is that it allows for “stateless” estimators such asFunctionTransformer
to say that is it always fitted.On 1.0, the snippet in https://github.com/scikit-learn/scikit-learn/issues/18648#issuecomment-937317447 and the original issue now works.