scikit-learn custom transformer is raising NotFitted Error
See original GitHub issueDescribe the bug
I was experimenting with scikit-learn after updating scikit-learn from 0.21.1 to 1.0.2, and found that the custom transformer had stopped working. I wonder what might have changed in version 1.0.2 which caused this issue. Is there a workaround to resolve this issue? Below are code snippets of the same to reproduce the issue:
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.tree import DecisionTreeClassifier
class CustomVectorizer(BaseEstimator, TransformerMixin):
"""This class will perform the Vectorization."""
def __init__(self, custom_content=None, custom_keyphrases=None):
self.custom_content = custom_content
self.custom_keyphrases = custom_keyphrases
def fit(self, X, y=None, *args, **kwargs):
"""fit"""
return self
def transform(self, X, y=None, **transform_params):
"""Transform"""
tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
tfidf_custom_keyphrases = self.custom_keyphrases.transform(X.KeyPhrases.values.astype('U')).todense().astype(np.float32)
tf_idf_X = np.hstack((tfidf_custom_content, tfidf_custom_keyphrases))
return tf_idf_X
class CustomSVD(BaseEstimator, TransformerMixin):
"""This class will perform the dimentionality reduction"""
def __init__(self, tsvd=None, reduce_dim=True):
self.tsvd = tsvd
self.reduce_dim = reduce_dim
def fit(self, X, y=None, *args, **kwargs):
"""fit"""
return self
def transform(self, X, y=None, **transform_params):
"""transform"""
if self.reduce_dim:
self.new_xtrain = self.tsvd.transform(X)
return self.new_xtrain.astype(np.float32)
else:
return X.astype(np.float32)
example_dict = {"content": ["dfg sdfsd f rtygdfgdf sf sdf sdfdg df", "dfgdf sfdsd fs ertger sd g",
"dfgfdgdf fdgdf gfhfhrt", "fghgf c xzcvxwerkjwhx"],
"KeyPhrases": ["sdfsd erfsd fsdf", " dfgdf ewrwe wef h dfh",
"fghfd wesdofjhcxlk sdf", "dfg dfg werwe"],
"output":["pass", "fail", "pass", "fail"]}
# df = pd.DataFrame(example_dict)
print(df)
X = df[["content", "KeyPhrases"]]
y = df[["output"]]
tf_content = TfidfVectorizer(sublinear_tf=True, smooth_idf=True, stop_words='english')
tf_keyphrase = TfidfVectorizer(sublinear_tf=True, smooth_idf=True, stop_words='english')
tfidf_content = tf_content.fit_transform(X.content).toarray()
tfidf_keyphrases = tf_keyphrase.fit_transform(X.KeyPhrases.values.astype('U')).toarray()
X_temp = np.hstack((tfidf_content, tfidf_keyphrases))
# on top of X_temp I applied TruncatedSVD get the n_component for some thresold
# using explained varience here i'm hardcoding the n_components and fitting it.
tsvd = TruncatedSVD(n_components = 20, random_state=42)
X_new = tsvd.fit_transform(X_temp)
# pipeline
# Note: tf_content, tf_keyphrase and tsvd are already fitted and I'm passing them to
# custom vectorizer and custom svd. hence i'm not doing aything in fit method.
# `cls` is a classifier.
rf_pipline = Pipeline([
('vectorizer', CustomVectorizer(tf_content, tf_keyphrase)),
('reduce_dim', CustomSVD(tsvd=tsvd)),
('rf_classifier', RandomForestClassifier())])
rf_search = {
'vectorizer': [CustomVectorizer(tf_content, tf_keyphrase)],
'reduce_dim': [CustomSVD(tsvd=tsvd)],
'rf_classifier': [RandomForestClassifier()],
'rf_classifier__n_estimators': [10,20],
}
cls_pipeline = RandomizedSearchCV(rf_pipline, rf_search, n_iter=2, cv=2, verbose=1)
cls_pipeline.fit(X,y)
print(cls_pipeline.score(X,y))
print(cls_pipeline.predict_proba(X))
This was working in scikit-learn 0.21.1 as expected and giving the below output:
------------------------
>>>1.0
[[0.3 0.7]
[0.6 0.4]
[0.2 0.8]
[0.8 0.2]]
but in scikit-learn 1.0.2, I’m getting the below error:
----------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
File "sklearn_custom_transformer_issue.py", line 106, in <module>
cls_pipeline.fit(X,y)
File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\model_selection\_search.py", line 926, in fit
self.best_estimator_.fit(X, y, **fit_params)
File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 390, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 348, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\joblib\memory.py", line 349, in __call__
return self.func(*args, **kwargs)
File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 893, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\base.py", line 855, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
File "sklearn_custom_transformer_issue.py", line 36, in transform
tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\feature_extraction\text.py", line 2099, in transform
check_is_fitted(self, msg="The TF-IDF vectorizer is not fitted")
File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 1222, in check_is_fitted
raise NotFittedError(msg % {"name": type(estimator).__name__})
sklearn.exceptions.NotFittedError: The TF-IDF vectorizer is not fitted
Also, when I defined a new classifier and used the voting classifier as shown below, I’m getting NotFitted Error in scikit-learn 1.0.2 but the same code was working with scikit-learn 0.21.1 :
# another pipeline
dc_pipline1 = Pipeline([
('vectorizer', CustomVectorizer(tf_content, tf_keyphrase)),
('reduce_dim', CustomSVD(tsvd=tsvd)),
('dc_classifier', DecisionTreeClassifier())])
dc_search = {
'vectorizer': [CustomVectorizer(tf_content, tf_keyphrase)],
'reduce_dim': [CustomSVD(tsvd=tsvd)],
'dc_classifier': [DecisionTreeClassifier()],
'dc_classifier__max_depth': [4, 10],
}
cls_pipeline1 = RandomizedSearchCV(dc_pipline1, dc_search, n_iter=2, cv=2, verbose=1)
Vot_cls = VotingClassifier(estimators=[('rf', cls_pipline),
('dt', cls_pipline1)],
voting='soft')
Vot_cls.fit(X, y)
----------------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
File "temp.py", line 88, in <module>
Vot_cls.fit(X, y)
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_voting.py", line 292, in fit
return super().fit(X, transformed_y, sample_weight)
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_voting.py", line 74, in fit
self.estimators_ = Parallel(n_jobs=self.n_jobs)(
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 1041, in __call__
if self.dispatch_one_batch(iterator):
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 859, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 777, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
self.results = batch()
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\fixes.py", line 222, in __call__
return self.function(*args, **kwargs)
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_base.py", line 39, in _fit_single_estimator
estimator.fit(X, y)
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 341, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 303, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\memory.py", line 352, in __call__
return self.func(*args, **kwargs)
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\base.py", line 702, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
File "temp.py", line 22, in transform
tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\feature_extraction\text.py", line 1872, in transform
check_is_fitted(self, msg='The TF-IDF vectorizer is not fitted')
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 1041, in check_is_fitted
raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: The TF-IDF vectorizer is not fitted
Just wondering what got changed when we call it using the voting classifier. does it clone the estimator and it is not able to pass the fitted instance of tfidf to custom vectorizer.
Versions
scikit-learn=0.24.1 numpy=1.19.2 scipy=1.6.0 pandas=1.2.1 platform: Windows_x64 Python=3.6.10
Note: Code was working fine in scikit-learn version: 0.21.1 , numpy: 1.18.1, scipy: 1.3.1 , pandas:0.25.1.
Reproducible code: code
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (7 by maintainers)
Top GitHub Comments
Our
Pipeline
does not follow our own convention: https://github.com/scikit-learn/scikit-learn/issues/8157 We usually always clone the parameters in the constructor. The only one that do not do that isPipeline
. However, theVotingClassifier
will clone each pipeline first and the inner estimator. Therefore theTfidfVectorizer
will get cloned and it will be equivalent to an unfitted estimator.I assume that we should fix our
Pipeline
but this is not straightforward since users are relying on these features. If we have aPipeline
that clone steps, then we should probably look at how to freeze estimator (https://github.com/scikit-learn/scikit-learn/issues/8370) such that they don’t get unfitted during cloning.We still don’t have a minimal reproducible here.
You can use
__sklearn_is_fitted__
to check if the sub-estimator is fitted and return true. But your code above loads stuff from a file which it shouldn’t that should be done infit
. I’m closing this, will re-open once we have a minimal reproducible example.