Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

scikit-learn custom transformer is raising NotFitted Error

See original GitHub issue

Describe the bug

I was experimenting with scikit-learn after updating scikit-learn from 0.21.1 to 1.0.2, and found that the custom transformer had stopped working. I wonder what might have changed in version 1.0.2 which caused this issue. Is there a workaround to resolve this issue? Below are code snippets of the same to reproduce the issue:

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.tree import DecisionTreeClassifier

class CustomVectorizer(BaseEstimator, TransformerMixin):
    """This class will perform the Vectorization."""
    def __init__(self, custom_content=None, custom_keyphrases=None):
        self.custom_content = custom_content
        self.custom_keyphrases = custom_keyphrases
    
    def fit(self, X, y=None, *args, **kwargs):
        """fit"""
        return self

    def transform(self, X, y=None, **transform_params):
        """Transform"""
        tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
        tfidf_custom_keyphrases = self.custom_keyphrases.transform(X.KeyPhrases.values.astype('U')).todense().astype(np.float32)

        tf_idf_X  = np.hstack((tfidf_custom_content, tfidf_custom_keyphrases))
        return tf_idf_X

class CustomSVD(BaseEstimator, TransformerMixin):
    """This class will perform the dimentionality reduction"""
    def __init__(self, tsvd=None, reduce_dim=True):
        self.tsvd = tsvd
        self.reduce_dim = reduce_dim
    
    def fit(self, X, y=None, *args, **kwargs):
        """fit"""
        return self

    def transform(self, X, y=None, **transform_params):
        """transform"""
        if self.reduce_dim:
            self.new_xtrain = self.tsvd.transform(X)
            return self.new_xtrain.astype(np.float32)
        else:
            return X.astype(np.float32)

example_dict = {"content": ["dfg sdfsd f rtygdfgdf sf sdf sdfdg df", "dfgdf sfdsd fs ertger sd g",
                            "dfgfdgdf fdgdf gfhfhrt", "fghgf c xzcvxwerkjwhx"],
                "KeyPhrases": ["sdfsd erfsd fsdf", " dfgdf ewrwe wef h dfh",
                                "fghfd wesdofjhcxlk sdf", "dfg dfg werwe"],
                "output":["pass", "fail", "pass", "fail"]}
# df = pd.DataFrame(example_dict)
print(df)

X = df[["content", "KeyPhrases"]]
y = df[["output"]]

tf_content = TfidfVectorizer(sublinear_tf=True, smooth_idf=True, stop_words='english')
tf_keyphrase = TfidfVectorizer(sublinear_tf=True, smooth_idf=True, stop_words='english')

tfidf_content = tf_content.fit_transform(X.content).toarray()
tfidf_keyphrases = tf_keyphrase.fit_transform(X.KeyPhrases.values.astype('U')).toarray()

X_temp = np.hstack((tfidf_content, tfidf_keyphrases))

  # on top of X_temp I applied TruncatedSVD  get the n_component for some thresold
# using explained varience here i'm hardcoding the n_components and fitting it.
tsvd = TruncatedSVD(n_components = 20, random_state=42)
X_new = tsvd.fit_transform(X_temp)

# pipeline
# Note: tf_content, tf_keyphrase and tsvd are already fitted and I'm passing them to 
# custom vectorizer and custom svd. hence i'm not doing aything in fit method.
# `cls` is a classifier.
rf_pipline = Pipeline([
    ('vectorizer', CustomVectorizer(tf_content, tf_keyphrase)),
    ('reduce_dim', CustomSVD(tsvd=tsvd)),
    ('rf_classifier', RandomForestClassifier())])

rf_search = {
    'vectorizer': [CustomVectorizer(tf_content, tf_keyphrase)],
    'reduce_dim': [CustomSVD(tsvd=tsvd)],
    'rf_classifier': [RandomForestClassifier()],
    'rf_classifier__n_estimators': [10,20],
}

cls_pipeline = RandomizedSearchCV(rf_pipline, rf_search, n_iter=2, cv=2, verbose=1)

cls_pipeline.fit(X,y)
print(cls_pipeline.score(X,y))
print(cls_pipeline.predict_proba(X))

This was working in scikit-learn 0.21.1 as expected and giving the below output:

------------------------
>>>1.0
[[0.3 0.7]
 [0.6 0.4]
 [0.2 0.8]
 [0.8 0.2]]

but in scikit-learn 1.0.2, I’m getting the below error:

----------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "sklearn_custom_transformer_issue.py", line 106, in <module>
    cls_pipeline.fit(X,y)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\model_selection\_search.py", line 926, in fit
    self.best_estimator_.fit(X, y, **fit_params)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\joblib\memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 893, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\base.py", line 855, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "sklearn_custom_transformer_issue.py", line 36, in transform
    tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\feature_extraction\text.py", line 2099, in transform
    check_is_fitted(self, msg="The TF-IDF vectorizer is not fitted")
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 1222, in check_is_fitted
    raise NotFittedError(msg % {"name": type(estimator).__name__})
sklearn.exceptions.NotFittedError: The TF-IDF vectorizer is not fitted

Also, when I defined a new classifier and used the voting classifier as shown below, I’m getting NotFitted Error in scikit-learn 1.0.2 but the same code was working with scikit-learn 0.21.1 :


# another pipeline
dc_pipline1 = Pipeline([
('vectorizer', CustomVectorizer(tf_content, tf_keyphrase)),
('reduce_dim', CustomSVD(tsvd=tsvd)),
('dc_classifier', DecisionTreeClassifier())])

dc_search = {
    'vectorizer': [CustomVectorizer(tf_content, tf_keyphrase)],
    'reduce_dim': [CustomSVD(tsvd=tsvd)],
    'dc_classifier': [DecisionTreeClassifier()],
    'dc_classifier__max_depth': [4, 10],
}

cls_pipeline1 = RandomizedSearchCV(dc_pipline1, dc_search, n_iter=2, cv=2, verbose=1)

Vot_cls = VotingClassifier(estimators=[('rf', cls_pipline), 
                                        ('dt', cls_pipline1)], 
                                       voting='soft')

Vot_cls.fit(X, y)

----------------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "temp.py", line 88, in <module>
    Vot_cls.fit(X, y)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_voting.py", line 292, in fit
    return super().fit(X, transformed_y, sample_weight)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_voting.py", line 74, in fit
    self.estimators_ = Parallel(n_jobs=self.n_jobs)(
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_base.py", line 39, in _fit_single_estimator     
    estimator.fit(X, y)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\base.py", line 702, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "temp.py", line 22, in transform
    tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\feature_extraction\text.py", line 1872, in transform      
    check_is_fitted(self, msg='The TF-IDF vectorizer is not fitted')
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 1041, in check_is_fitted       
    raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: The TF-IDF vectorizer is not fitted

Just wondering what got changed when we call it using the voting classifier. does it clone the estimator and it is not able to pass the fitted instance of tfidf to custom vectorizer.

Versions

scikit-learn=0.24.1 numpy=1.19.2 scipy=1.6.0 pandas=1.2.1 platform: Windows_x64 Python=3.6.10

Note: Code was working fine in scikit-learn version: 0.21.1 , numpy: 1.18.1, scipy: 1.3.1 , pandas:0.25.1.

Reproducible code: code

Issue Analytics

State:
Created 2 years ago
Comments:14 (7 by maintainers)

Top GitHub Comments

1reaction

glemaitrecommented, Apr 22, 2021

Our Pipeline does not follow our own convention: https://github.com/scikit-learn/scikit-learn/issues/8157 We usually always clone the parameters in the constructor. The only one that do not do that is Pipeline. However, the VotingClassifier will clone each pipeline first and the inner estimator. Therefore the TfidfVectorizer will get cloned and it will be equivalent to an unfitted estimator.

I assume that we should fix our Pipeline but this is not straightforward since users are relying on these features. If we have a Pipeline that clone steps, then we should probably look at how to freeze estimator (https://github.com/scikit-learn/scikit-learn/issues/8370) such that they don’t get unfitted during cloning.

0reactions

adrinjalalicommented, Aug 26, 2022

We still don’t have a minimal reproducible here.

You can use __sklearn_is_fitted__ to check if the sub-estimator is fitted and return true. But your code above loads stuff from a file which it shouldn’t that should be done in fit. I’m closing this, will re-open once we have a minimal reproducible example.