question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

scikit-learn custom transformer is raising NotFitted Error

See original GitHub issue

Describe the bug

I was experimenting with scikit-learn after updating scikit-learn from 0.21.1 to 1.0.2, and found that the custom transformer had stopped working. I wonder what might have changed in version 1.0.2 which caused this issue. Is there a workaround to resolve this issue? Below are code snippets of the same to reproduce the issue:

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.tree import DecisionTreeClassifier
class CustomVectorizer(BaseEstimator, TransformerMixin):
    """This class will perform the Vectorization."""
    def __init__(self, custom_content=None, custom_keyphrases=None):
        self.custom_content = custom_content
        self.custom_keyphrases = custom_keyphrases
    
    def fit(self, X, y=None, *args, **kwargs):
        """fit"""
        return self

    def transform(self, X, y=None, **transform_params):
        """Transform"""
        tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
        tfidf_custom_keyphrases = self.custom_keyphrases.transform(X.KeyPhrases.values.astype('U')).todense().astype(np.float32)

        tf_idf_X  = np.hstack((tfidf_custom_content, tfidf_custom_keyphrases))
        return tf_idf_X
class CustomSVD(BaseEstimator, TransformerMixin):
    """This class will perform the dimentionality reduction"""
    def __init__(self, tsvd=None, reduce_dim=True):
        self.tsvd = tsvd
        self.reduce_dim = reduce_dim
    
    def fit(self, X, y=None, *args, **kwargs):
        """fit"""
        return self

    def transform(self, X, y=None, **transform_params):
        """transform"""
        if self.reduce_dim:
            self.new_xtrain = self.tsvd.transform(X)
            return self.new_xtrain.astype(np.float32)
        else:
            return X.astype(np.float32)
example_dict = {"content": ["dfg sdfsd f rtygdfgdf sf sdf sdfdg df", "dfgdf sfdsd fs ertger sd g",
                            "dfgfdgdf fdgdf gfhfhrt", "fghgf c xzcvxwerkjwhx"],
                "KeyPhrases": ["sdfsd erfsd fsdf", " dfgdf ewrwe wef h dfh",
                                "fghfd wesdofjhcxlk sdf", "dfg dfg werwe"],
                "output":["pass", "fail", "pass", "fail"]}
# df = pd.DataFrame(example_dict)
print(df)

X = df[["content", "KeyPhrases"]]
y = df[["output"]]

tf_content = TfidfVectorizer(sublinear_tf=True, smooth_idf=True, stop_words='english')
tf_keyphrase = TfidfVectorizer(sublinear_tf=True, smooth_idf=True, stop_words='english')

tfidf_content = tf_content.fit_transform(X.content).toarray()
tfidf_keyphrases = tf_keyphrase.fit_transform(X.KeyPhrases.values.astype('U')).toarray()

X_temp = np.hstack((tfidf_content, tfidf_keyphrases))

  # on top of X_temp I applied TruncatedSVD  get the n_component for some thresold
# using explained varience here i'm hardcoding the n_components and fitting it.
tsvd = TruncatedSVD(n_components = 20, random_state=42)
X_new = tsvd.fit_transform(X_temp)

# pipeline
# Note: tf_content, tf_keyphrase and tsvd are already fitted and I'm passing them to 
# custom vectorizer and custom svd. hence i'm not doing aything in fit method.
# `cls` is a classifier.
rf_pipline = Pipeline([
    ('vectorizer', CustomVectorizer(tf_content, tf_keyphrase)),
    ('reduce_dim', CustomSVD(tsvd=tsvd)),
    ('rf_classifier', RandomForestClassifier())])

rf_search = {
    'vectorizer': [CustomVectorizer(tf_content, tf_keyphrase)],
    'reduce_dim': [CustomSVD(tsvd=tsvd)],
    'rf_classifier': [RandomForestClassifier()],
    'rf_classifier__n_estimators': [10,20],
}

cls_pipeline = RandomizedSearchCV(rf_pipline, rf_search, n_iter=2, cv=2, verbose=1)

cls_pipeline.fit(X,y)
print(cls_pipeline.score(X,y))
print(cls_pipeline.predict_proba(X))

This was working in scikit-learn 0.21.1 as expected and giving the below output:

------------------------
>>>1.0
[[0.3 0.7]
 [0.6 0.4]
 [0.2 0.8]
 [0.8 0.2]]

but in scikit-learn 1.0.2, I’m getting the below error:

----------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "sklearn_custom_transformer_issue.py", line 106, in <module>
    cls_pipeline.fit(X,y)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\model_selection\_search.py", line 926, in fit
    self.best_estimator_.fit(X, y, **fit_params)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\joblib\memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 893, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\base.py", line 855, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "sklearn_custom_transformer_issue.py", line 36, in transform
    tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\feature_extraction\text.py", line 2099, in transform
    check_is_fitted(self, msg="The TF-IDF vectorizer is not fitted")
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 1222, in check_is_fitted
    raise NotFittedError(msg % {"name": type(estimator).__name__})
sklearn.exceptions.NotFittedError: The TF-IDF vectorizer is not fitted

Also, when I defined a new classifier and used the voting classifier as shown below, I’m getting NotFitted Error in scikit-learn 1.0.2 but the same code was working with scikit-learn 0.21.1 :


# another pipeline
dc_pipline1 = Pipeline([
('vectorizer', CustomVectorizer(tf_content, tf_keyphrase)),
('reduce_dim', CustomSVD(tsvd=tsvd)),
('dc_classifier', DecisionTreeClassifier())])

dc_search = {
    'vectorizer': [CustomVectorizer(tf_content, tf_keyphrase)],
    'reduce_dim': [CustomSVD(tsvd=tsvd)],
    'dc_classifier': [DecisionTreeClassifier()],
    'dc_classifier__max_depth': [4, 10],
}

cls_pipeline1 = RandomizedSearchCV(dc_pipline1, dc_search, n_iter=2, cv=2, verbose=1)

Vot_cls = VotingClassifier(estimators=[('rf', cls_pipline), 
                                        ('dt', cls_pipline1)], 
                                       voting='soft')

Vot_cls.fit(X, y)

----------------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "temp.py", line 88, in <module>
    Vot_cls.fit(X, y)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_voting.py", line 292, in fit
    return super().fit(X, transformed_y, sample_weight)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_voting.py", line 74, in fit
    self.estimators_ = Parallel(n_jobs=self.n_jobs)(
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_base.py", line 39, in _fit_single_estimator     
    estimator.fit(X, y)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\base.py", line 702, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "temp.py", line 22, in transform
    tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\feature_extraction\text.py", line 1872, in transform      
    check_is_fitted(self, msg='The TF-IDF vectorizer is not fitted')
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 1041, in check_is_fitted       
    raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: The TF-IDF vectorizer is not fitted

Just wondering what got changed when we call it using the voting classifier. does it clone the estimator and it is not able to pass the fitted instance of tfidf to custom vectorizer.

Versions

scikit-learn=0.24.1 numpy=1.19.2 scipy=1.6.0 pandas=1.2.1 platform: Windows_x64 Python=3.6.10

Note: Code was working fine in scikit-learn version: 0.21.1 , numpy: 1.18.1, scipy: 1.3.1 , pandas:0.25.1.

Reproducible code: code

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
glemaitrecommented, Apr 22, 2021

Our Pipeline does not follow our own convention: https://github.com/scikit-learn/scikit-learn/issues/8157 We usually always clone the parameters in the constructor. The only one that do not do that is Pipeline. However, the VotingClassifier will clone each pipeline first and the inner estimator. Therefore the TfidfVectorizer will get cloned and it will be equivalent to an unfitted estimator.

I assume that we should fix our Pipeline but this is not straightforward since users are relying on these features. If we have a Pipeline that clone steps, then we should probably look at how to freeze estimator (https://github.com/scikit-learn/scikit-learn/issues/8370) such that they don’t get unfitted during cloning.

0reactions
adrinjalalicommented, Aug 26, 2022

We still don’t have a minimal reproducible here.

You can use __sklearn_is_fitted__ to check if the sub-estimator is fitted and return true. But your code above loads stuff from a file which it shouldn’t that should be done in fit. I’m closing this, will re-open once we have a minimal reproducible example.

Read more comments on GitHub >

github_iconTop Results From Across the Web

scikit-learn custom transformer is raising NotFitted Error
I was experimenting with scikit-learn after updating scikit-learn from 0.21.1 to 1.0.2, and found that the custom transformer had stopped ...
Read more >
Pipelines & Custom Transformers in scikit-learn: The step-by ...
This article will cover: Why another tutorial on Pipelines? Creating a Custom Transformer from scratch, to include in the Pipeline.
Read more >
Sklearn pipeline and custom transformers to remove specific ...
What I want to do is identify outliers using an IQR-filter, set the outlier values to 'OUTLIER' (not NaN), and then remove all...
Read more >
sklearn.preprocessing.FunctionTransformer
Constructs a transformer from an arbitrary callable. ... It can be used for a sanity check, raising a warning when the condition is...
Read more >
Subclassing the Scikit-Learn Pipeline | Analytics Vidhya
The trailing underscore indicates an attribute that exists in an object after it has been “fitted”. (Yes, “fitted” is how sklearn refers to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found