question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

check_is_fitted gives false positive when extracted from ensemble classifier

See original GitHub issue

Describe the bug

I trained several classifiers on my dataset, and then created an ensemble classifier (voting classifier) from them. While each of the estimators, stored at .estimators_, have been fit and used independently and within the ensemble, and even after extracting them from the ensemble, they fail a check_if_fitted test, so I cannot use them on their own in a context that checks for fit, or in another ensemble classifier.

Steps/Code to Reproduce

# Handle imports
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.validation import check_is_fitted
from copy import deepcopy
import numpy as np

# Generate a dummy dataset
y = np.random.choice([0, 1], size=50)
X = np.zeros((len(y), 100))
for idx, _y in enumerate(y):
    X[idx, :] = 10*(np.random.random((100)) - 0.5) + int(_y)*0.75 + 20 * (np.random.random((100)) - 0.2)

yval = np.random.choice([0, 1], size=5)
Xval = np.zeros((len(yval), 100))

# Create and train classifiers across some folds
clf = Pipeline([('pca', PCA()), ('svm', SVC())])
cv = KFold(n_splits=5)

clfs = []
for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    tmpclf = deepcopy(clf)
    tmpclf.fit(X_train, y_train)
    clfs += [('fold{0}'.format(idx), tmpclf)]
    
    print(tmpclf.score(X_test, y_test))
print(clfs)

# Create and initialize VotingClassifier
vclf = VotingClassifier(clfs)

vclf.estimators_ = [c[1] for c in clfs]  # pass pre-fit estimators
vclf.le_ = LabelEncoder().fit(yval)
vclf.classes_ = vclf.le_.classes_

print(vclf.score(Xval, yval))

# Finally, and this is where the error occurs, extract original classifiers
orig_clf = vclf.estimators_[0]

print(orig_clf.score(Xval, yval))
check_is_fitted(orig_clf)

Expected Results

No error is thrown.

Actual Results

---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-77-d38acec5e641> in <module>
      2 
      3 print(orig_clf.score(Xval, yval))
----> 4 check_is_fitted(orig_clf)

~/code/env/agg/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/code/env/agg/lib/python3.7/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
   1017 
   1018     if not attrs:
-> 1019         raise NotFittedError(msg % {'name': type(estimator).__name__})
   1020 
   1021 

NotFittedError: This Pipeline instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Versions

System:
    python: 3.7.3 (default, Dec 13 2019, 19:58:14)  [Clang 11.0.0 (clang-1100.0.33.17)]
executable: /Users/greg/code/env/agg/bin/python3
   machine: Darwin-19.6.0-x86_64-i386-64bit

Python dependencies:
          pip: 20.2.3
   setuptools: 50.3.0
      sklearn: 0.23.2
        numpy: 1.19.2
        scipy: 1.5.2
       Cython: None
       pandas: 1.1.3
   matplotlib: 3.3.2
       joblib: 0.17.0
threadpoolctl: 2.1.0

Built with OpenMP: True

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
alfaro96commented, Oct 20, 2020

Hey @gkiar,

The issue is unrelated with the ensemble, but with the Pipeline (meta-)estimator which has not fitted attributes (ending with a trailing underscore).

The following code snippet fails:

from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils.validation import check_is_fitted

X, y = load_iris(return_X_y=True)
model = make_pipeline(DecisionTreeClassifier(random_state=0))
clf = model.fit(X, y)

check_is_fitted(clf)

I have found the following comment in the Pipeline (meta-)estimator code:

https://github.com/scikit-learn/scikit-learn/blob/5d5329c473791c90ebc58b4a18a923d1e6c216b9/sklearn/pipeline.py#L263-L264

I think that we should change self.steps by self.steps_ to address this comment. Thus, the Pipeline (meta-)estimator would provide fitted attributes and we should solve this issue.

Before confirming this is a bug, I would like to know the opinion of a core-developer (pinging @glemaitre that previously worked in the Pipeline (meta-)estimator).

0reactions
thomasjpfancommented, Oct 8, 2021

With scikit-learn 1.0, we introduced a new __sklearn_is_fitted__ API that is currently being used by Pipeline to denote if it is fitted. The advantage of this is that it allows for “stateless” estimators such as FunctionTransformer to say that is it always fitted.

On 1.0, the snippet in https://github.com/scikit-learn/scikit-learn/issues/18648#issuecomment-937317447 and the original issue now works.

Read more comments on GitHub >

github_iconTop Results From Across the Web

1.11. Ensemble methods — scikit-learn 1.2.0 documentation
This is an array with shape (n_features,) whose values are positive and sum to ... The module sklearn.ensemble provides methods for both classification...
Read more >
Reduce false positive in extremely high imbalance testing set
I know the FPR is high but it gives an acceptable number of FP in training and testing of balanced dataset and I...
Read more >
binary classification target specifically on false positive
The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values ...
Read more >
Ensemble Machine Learning Algorithms in Python with scikit ...
Ensembles can give you a boost in accuracy on your dataset. In this post you will discover how you can create some of...
Read more >
Ensemble Learning in extensive details with examples in ...
A hard voting classifier just counts the votes of each classifier in the ensemble and picks the class that gets the most votes....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found