Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent numbers of samples issue with fit_params in CalibratedClassifierCV

See original GitHub issue

Describe the bug

Trying to use fit_params with CalibratedClassifierCV in v1.1 but receives fail of fit parameters when pass to classifier.

I have 1000 rows.
I split it into train and validation, 800 and 200 relatively.
The validation data part is passed to eval_set parameterr in fit_params and I fit with train part which is 800 size.
The train data part is using to do learning and I have cross-val in optimization with n_splits=5 splits, i.e., I have each of 160 rows (800/5=160). Finally, I receive ValueError: Found input variables with inconsistent numbers of samples: [640, 1] and 640 it seems 4/5 of data, so it’s sub-train part in inner cv to evaluate on 1/5 since we have 5 folds.

What I miss here? Where I fail?

See details below.

Steps/Code to Reproduce

# Description
# This code generates pseudo-data for this test. PyTorch is needed.
# In case you use some libs to install in the environment, please run your installation to have additionally pytorch be installed just by this command below to obtain pytorch
# pip install -r requirements.txt -f https://download.pytorch.org/whl/cu111/torch_stable.html

import random
import numpy as np
import pandas as pd
from datetime import datetime
from typing import List, Dict, Any

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, LabelBinarizer, LabelEncoder, OrdinalEncoder
from sklearn.model_selection import KFold, GroupKFold, GridSearchCV, train_test_split
from sklearn.calibration import CalibratedClassifierCV 

from pytorch_tabnet.tab_model import TabNetClassifier

import gc
import torch
torch.cuda.empty_cache()

# Generate random data: 20 features, id, label
df = pd.DataFrame()
size = 1000
df[f'id'] = [k for k in range(size)]
for c in range(1,11):
    df[f'feature{c}_float'] = [random.uniform(-100,100) for k in range(size)]
df[f'feature{c}_int'] = [random.randrange(0, 1000, 10) for k in range(size)]; c+=1
df[f'feature{c}_int'] = [random.randrange(-100, 100, 10) for k in range(size)]; c+=1
df[f'feature{c}_int'] = [random.randrange(2015, 2020, 1) for k in range(size)]; c+=1
df[f'feature{c}_int'] = [random.choice([-1, 1, np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_int'] = [random.choice([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["a", "b", "c", "d", "e", np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["red", "blue", "green", np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["yes", "no", "neutral", np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["animal", "human", np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["male", "female", "N/A", np.nan]) for k in range(size)]; c+=1
for col in range(15,20):
    df[f'feature{col}_cat'] = df[f'feature{col}_cat'].astype('category') # set type for categorical features
df['label'] = [random.choice([-1, 0, 1]) for k in range(size)]
model_features = set(df.drop(columns=['id','label']).columns)
#model_features = set(df.drop(columns=['label']).columns)
    
def make_model_pipeline(model_class, categoricals: List[str], numericals: List[str],
                        drops: List[str], model_parameters: Dict[str, Any]) -> Pipeline:
    
    model_preprocessing = ("preprocessing", ColumnTransformer([
        ('cat', Pipeline([('imputer', SimpleImputer(strategy='constant')), 
                          ('oenc', OrdinalEncoder(handle_unknown ='use_encoded_value',unknown_value = -1))
                          ]), categoricals),
        ('num', Pipeline([("scaler", RobustScaler()), 
                          ("imputer", SimpleImputer(strategy="median"))
                          ]), numericals),
        ('drop', 'drop', drops),
    ], remainder='drop'))
    
    calibrated_classifier = ("calibrated_classifier", CalibratedClassifierCV(
        base_estimator=model_class(**model_parameters), method='isotonic', cv=5))
    
    pipeline = Pipeline([model_preprocessing, calibrated_classifier])
    
    return pipeline

x_train, y_train = df.drop(columns='label'), df['label']

# Features
drop = sorted(set(x_train.columns) - set(x_train[model_features].columns))
cat = sorted(x_train[model_features].select_dtypes(include=['category']).columns)
num = sorted(set(x_train[model_features].columns) - set(cat))
use_features = sorted(set(cat).union(set(num)) - set(drop))

# Folds yearly
year = x_train["feature12_int"] # year
year_cv = GroupKFold(n_splits=year.nunique())

# Make pipeline
model_class = TabNetClassifier
model_parameters = {
    'n_d': 16, 'n_a': 16,
    'n_steps': 5, 
    'n_independent': 2, 
    'n_shared': 2, 
    'clip_value': 2.0, 
    'gamma': 1.5, 
    'lambda_sparse': 0.01
    }
param_grid = {
    'n_steps': [3,5], 
    'momentum': [0.3, 0.5]
    }  

opt_pipeline = make_model_pipeline(model_class, cat, num, drop,
                                   {k: v for k, v in model_parameters.items() if k not in param_grid})
opt_pipeline[1].base_estimator.set_params(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=0.02, weight_decay = 1e-5),
    scheduler_params = {"gamma": 0.95, "step_size": 20},
    scheduler_fn=torch.optim.lr_scheduler.StepLR, 
    epsilon=1e-15 
)

param_grid = {'calibrated_classifier__base_estimator__n_steps': [3, 5]}
param_grid = {'calibrated_classifier__base_estimator__momentum': [0.3, 0.5]}

# Data split
print(f"\nAll data: {x_train.shape} {y_train.shape}")
x_train_prep, x_valid_prep, y_train_prep, y_valid_prep = train_test_split(x_train, y_train, 
                                                                          test_size=0.20, 
                                                                          random_state=123)

# Preprocessing for eval_set
le = LabelEncoder() 
le.fit(y_train_prep)
y_train_prep, y_valid_prep = le.transform(y_train_prep), le.transform(y_valid_prep)

scc = opt_pipeline.get_params()['preprocessing'].transformers[0][1].named_steps['imputer']
scc.fit(x_train_prep[cat])
oenc = opt_pipeline.get_params()['preprocessing'].transformers[0][1].named_steps['oenc']

sc = opt_pipeline.get_params()['preprocessing'].transformers[1][1].named_steps['scaler']
sc.fit(x_train_prep[num])

imp = opt_pipeline.get_params()['preprocessing'].transformers[1][1].named_steps['imputer']
imp.fit(x_train_prep[num])

def preprocessing(x_t, x_v, cat, num, sc, scc, oenc, imp):
    # Preprocessing manually to have train/valid split for ANN
    def prep(prep, data, variables):
        df = pd.DataFrame(prep.transform(data[variables]),
                          columns=data[variables].columns,
                          index=data[variables].index).values
        return df
    # For train and validation
    x_t[cat] = prep(scc, x_t, cat)
    oenc.fit(x_t[cat])
    x_t[cat] = prep(oenc, x_t, cat)
    x_t[num] = prep(sc, x_t, num)
    x_t[num] = prep(imp, x_t, num)
    x_v[cat] = prep(scc, x_v, cat)
    x_v[cat] = prep(oenc, x_v, cat)
    x_v[num] = prep(sc, x_v, num)
    x_v[num] = prep(imp, x_v, num)
    return x_t, x_v

x_train_prep, x_valid_prep = preprocessing(
    x_train_prep,
    x_valid_prep,
    cat, num,
    sc, scc, oenc, imp
    )

# Find best params on whole dataset
model = GridSearchCV(estimator=opt_pipeline,
                        param_grid=param_grid,
                        cv=KFold(**inner_cv_params),
                        scoring='balanced_accuracy', 
                        refit=False, 
                        verbose=2)
fit_params = {}
fit_params['calibrated_classifier__eval_set']=[(x_valid_prep[use_features].values,y_valid_prep)]
fit_params['calibrated_classifier__eval_name']=['valid']
fit_params['calibrated_classifier__max_epochs']=100
fit_params['calibrated_classifier__patience']=10
fit_params['calibrated_classifier__batch_size']=32 
fit_params['calibrated_classifier__virtual_batch_size']=16 
fit_params['calibrated_classifier__drop_last']=False
#fit_params['calibrated_classifier__weights']=np.ones([y_train_prep.size]) / y_train_prep.size
model.fit(x_train_prep, y_train_prep, **fit_params) # ----> errors here.
# Fit model with best params
best_model_parameters = {k.split("__")[-1]: v for k, v in model.best_params_.items()}
pipeline = make_model_pipeline(model_class, sorted(set(cat) - set(drop)), 
                                sorted(set(num) - set(drop)), [], best_model_parameters)
pipeline.fit(x_train_prep, y_train_prep, **fit_params)

Expected Results

No error is expected, smooth learning process.

Actual Results

Traceback (most recent call last):
  File "/home/kabartay/sklearn_v1.1_test.py", line 203, in <module>
    model.fit(x_train_prep, y_train_prep, **fit_params)
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 875, in fit
    self._run_search(evaluate_candidates)
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 1375, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 852, in evaluate_candidates
    _warn_or_raise_about_fit_failures(out, self.error_score)
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 367, in _warn_or_raise_about_fit_failures
    raise ValueError(all_fits_failed_message)
ValueError: 
All the 10 fits failed.
It is is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/pipeline.py", line 382, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/calibration.py", line 283, in fit
    check_consistent_length(y, sample_aligned_params)
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/utils/validation.py", line 383, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [640, 1]

Issues with this https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py#L365 when we check here https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/calibration.py#L283

Versions

System:
    python: 3.9.12 (main, Apr  5 2022, 06:56:58)  [GCC 7.5.0]
executable: /home/utilisateur/anaconda3/envs/sklearn11/bin/python3
   machine: Linux-5.13.0-41-generic-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.1.0
          pip: 21.2.4
   setuptools: 49.2.0
        numpy: 1.21.0
        scipy: 1.8.0
       Cython: None
       pandas: 1.1.5
   matplotlib: 3.3.4
       joblib: 1.0.1
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/utilisateur/anaconda3/envs/sklearn11/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 12

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/utilisateur/anaconda3/envs/sklearn11/lib/python3.9/site-packages/numpy.libs/libopenblasp-r0-5bebc122.3.13.dev.so
        version: 0.3.13.dev
threading_layer: pthreads
   architecture: Haswell
    num_threads: 12

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/utilisateur/anaconda3/envs/sklearn11/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-8b9e111f.3.17.so
        version: 0.3.17
threading_layer: pthreads
   architecture: Haswell
    num_threads: 12

Issue Analytics

State:
Created a year ago
Comments:9 (5 by maintainers)

Top GitHub Comments

2reactions

ogriselcommented, Jun 2, 2022

For the sake of consistency I think we should be lenient in this case and behave similarly in CalibratedClassifierCV as we do in GridSearchCV.

1reaction

ogriselcommented, Jun 2, 2022

We already do call _check_fit_params in _fit_classifier_calibrator_pair so we implicitly assumed that no all the fit params are sampled align so we should probably remove the call to check_consistent_length on fit_params.values() in CalibratedClassifierCV.fit.

Top Results From Across the Web

cross validation alueerror('found input variables ... - You.com

ValueError: Found input variables with inconsistent numbers of samples: [2750, ... numbers of samples issue with fit_params in CalibratedClassifierCV#23422.

sklearn: Found arrays with inconsistent numbers of samples ...

It looks like sklearn requires the data shape of (row number, column number). If your data shape is (row number, ) like (999,...

sklearn.calibration.CalibratedClassifierCV

If None, then samples are equally weighted. **fit_paramsdict. Parameters to pass to the fit method of the underlying classifier. Returns ...

valueerror found input variables with inconsistent numbers of ...

Here I was creating a model based on multiple linear regression, but now I stuck with an ... = LinearRegression() regressor.fit( x_train, ...

Scikit correct way to calibrate classifiers with ... - Cross Validated

There are two things mentioned in the CalibratedClassifierCV docs that hint towards the ways ... Increasing the number of samples to 10,000:.