Inconsistent numbers of samples issue with fit_params in CalibratedClassifierCV
See original GitHub issueDescribe the bug
Trying to use fit_params
with CalibratedClassifierCV
in v1.1 but receives fail of fit parameters when pass to classifier.
- I have 1000 rows.
- I split it into train and validation, 800 and 200 relatively.
- The validation data part is passed to eval_set parameterr in
fit_params
and I fit with train part which is 800 size. - The train data part is using to do learning and I have cross-val in optimization with
n_splits=5
splits, i.e., I have each of 160 rows (800/5=160). Finally, I receiveValueError: Found input variables with inconsistent numbers of samples: [640, 1]
and 640 it seems 4/5 of data, so it’s sub-train part in inner cv to evaluate on 1/5 since we have 5 folds.
What I miss here? Where I fail?
See details below.
Steps/Code to Reproduce
# Description
# This code generates pseudo-data for this test. PyTorch is needed.
# In case you use some libs to install in the environment, please run your installation to have additionally pytorch be installed just by this command below to obtain pytorch
# pip install -r requirements.txt -f https://download.pytorch.org/whl/cu111/torch_stable.html
import random
import numpy as np
import pandas as pd
from datetime import datetime
from typing import List, Dict, Any
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, LabelBinarizer, LabelEncoder, OrdinalEncoder
from sklearn.model_selection import KFold, GroupKFold, GridSearchCV, train_test_split
from sklearn.calibration import CalibratedClassifierCV
from pytorch_tabnet.tab_model import TabNetClassifier
import gc
import torch
torch.cuda.empty_cache()
# Generate random data: 20 features, id, label
df = pd.DataFrame()
size = 1000
df[f'id'] = [k for k in range(size)]
for c in range(1,11):
df[f'feature{c}_float'] = [random.uniform(-100,100) for k in range(size)]
df[f'feature{c}_int'] = [random.randrange(0, 1000, 10) for k in range(size)]; c+=1
df[f'feature{c}_int'] = [random.randrange(-100, 100, 10) for k in range(size)]; c+=1
df[f'feature{c}_int'] = [random.randrange(2015, 2020, 1) for k in range(size)]; c+=1
df[f'feature{c}_int'] = [random.choice([-1, 1, np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_int'] = [random.choice([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["a", "b", "c", "d", "e", np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["red", "blue", "green", np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["yes", "no", "neutral", np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["animal", "human", np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["male", "female", "N/A", np.nan]) for k in range(size)]; c+=1
for col in range(15,20):
df[f'feature{col}_cat'] = df[f'feature{col}_cat'].astype('category') # set type for categorical features
df['label'] = [random.choice([-1, 0, 1]) for k in range(size)]
model_features = set(df.drop(columns=['id','label']).columns)
#model_features = set(df.drop(columns=['label']).columns)
def make_model_pipeline(model_class, categoricals: List[str], numericals: List[str],
drops: List[str], model_parameters: Dict[str, Any]) -> Pipeline:
model_preprocessing = ("preprocessing", ColumnTransformer([
('cat', Pipeline([('imputer', SimpleImputer(strategy='constant')),
('oenc', OrdinalEncoder(handle_unknown ='use_encoded_value',unknown_value = -1))
]), categoricals),
('num', Pipeline([("scaler", RobustScaler()),
("imputer", SimpleImputer(strategy="median"))
]), numericals),
('drop', 'drop', drops),
], remainder='drop'))
calibrated_classifier = ("calibrated_classifier", CalibratedClassifierCV(
base_estimator=model_class(**model_parameters), method='isotonic', cv=5))
pipeline = Pipeline([model_preprocessing, calibrated_classifier])
return pipeline
x_train, y_train = df.drop(columns='label'), df['label']
# Features
drop = sorted(set(x_train.columns) - set(x_train[model_features].columns))
cat = sorted(x_train[model_features].select_dtypes(include=['category']).columns)
num = sorted(set(x_train[model_features].columns) - set(cat))
use_features = sorted(set(cat).union(set(num)) - set(drop))
# Folds yearly
year = x_train["feature12_int"] # year
year_cv = GroupKFold(n_splits=year.nunique())
# Make pipeline
model_class = TabNetClassifier
model_parameters = {
'n_d': 16, 'n_a': 16,
'n_steps': 5,
'n_independent': 2,
'n_shared': 2,
'clip_value': 2.0,
'gamma': 1.5,
'lambda_sparse': 0.01
}
param_grid = {
'n_steps': [3,5],
'momentum': [0.3, 0.5]
}
opt_pipeline = make_model_pipeline(model_class, cat, num, drop,
{k: v for k, v in model_parameters.items() if k not in param_grid})
opt_pipeline[1].base_estimator.set_params(
optimizer_fn=torch.optim.Adam,
optimizer_params=dict(lr=0.02, weight_decay = 1e-5),
scheduler_params = {"gamma": 0.95, "step_size": 20},
scheduler_fn=torch.optim.lr_scheduler.StepLR,
epsilon=1e-15
)
param_grid = {'calibrated_classifier__base_estimator__n_steps': [3, 5]}
param_grid = {'calibrated_classifier__base_estimator__momentum': [0.3, 0.5]}
# Data split
print(f"\nAll data: {x_train.shape} {y_train.shape}")
x_train_prep, x_valid_prep, y_train_prep, y_valid_prep = train_test_split(x_train, y_train,
test_size=0.20,
random_state=123)
# Preprocessing for eval_set
le = LabelEncoder()
le.fit(y_train_prep)
y_train_prep, y_valid_prep = le.transform(y_train_prep), le.transform(y_valid_prep)
scc = opt_pipeline.get_params()['preprocessing'].transformers[0][1].named_steps['imputer']
scc.fit(x_train_prep[cat])
oenc = opt_pipeline.get_params()['preprocessing'].transformers[0][1].named_steps['oenc']
sc = opt_pipeline.get_params()['preprocessing'].transformers[1][1].named_steps['scaler']
sc.fit(x_train_prep[num])
imp = opt_pipeline.get_params()['preprocessing'].transformers[1][1].named_steps['imputer']
imp.fit(x_train_prep[num])
def preprocessing(x_t, x_v, cat, num, sc, scc, oenc, imp):
# Preprocessing manually to have train/valid split for ANN
def prep(prep, data, variables):
df = pd.DataFrame(prep.transform(data[variables]),
columns=data[variables].columns,
index=data[variables].index).values
return df
# For train and validation
x_t[cat] = prep(scc, x_t, cat)
oenc.fit(x_t[cat])
x_t[cat] = prep(oenc, x_t, cat)
x_t[num] = prep(sc, x_t, num)
x_t[num] = prep(imp, x_t, num)
x_v[cat] = prep(scc, x_v, cat)
x_v[cat] = prep(oenc, x_v, cat)
x_v[num] = prep(sc, x_v, num)
x_v[num] = prep(imp, x_v, num)
return x_t, x_v
x_train_prep, x_valid_prep = preprocessing(
x_train_prep,
x_valid_prep,
cat, num,
sc, scc, oenc, imp
)
# Find best params on whole dataset
model = GridSearchCV(estimator=opt_pipeline,
param_grid=param_grid,
cv=KFold(**inner_cv_params),
scoring='balanced_accuracy',
refit=False,
verbose=2)
fit_params = {}
fit_params['calibrated_classifier__eval_set']=[(x_valid_prep[use_features].values,y_valid_prep)]
fit_params['calibrated_classifier__eval_name']=['valid']
fit_params['calibrated_classifier__max_epochs']=100
fit_params['calibrated_classifier__patience']=10
fit_params['calibrated_classifier__batch_size']=32
fit_params['calibrated_classifier__virtual_batch_size']=16
fit_params['calibrated_classifier__drop_last']=False
#fit_params['calibrated_classifier__weights']=np.ones([y_train_prep.size]) / y_train_prep.size
model.fit(x_train_prep, y_train_prep, **fit_params) # ----> errors here.
# Fit model with best params
best_model_parameters = {k.split("__")[-1]: v for k, v in model.best_params_.items()}
pipeline = make_model_pipeline(model_class, sorted(set(cat) - set(drop)),
sorted(set(num) - set(drop)), [], best_model_parameters)
pipeline.fit(x_train_prep, y_train_prep, **fit_params)
Expected Results
No error is expected, smooth learning process.
Actual Results
Traceback (most recent call last):
File "/home/kabartay/sklearn_v1.1_test.py", line 203, in <module>
model.fit(x_train_prep, y_train_prep, **fit_params)
File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 875, in fit
self._run_search(evaluate_candidates)
File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 1375, in _run_search
evaluate_candidates(ParameterGrid(self.param_grid))
File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 852, in evaluate_candidates
_warn_or_raise_about_fit_failures(out, self.error_score)
File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 367, in _warn_or_raise_about_fit_failures
raise ValueError(all_fits_failed_message)
ValueError:
All the 10 fits failed.
It is is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/pipeline.py", line 382, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/calibration.py", line 283, in fit
check_consistent_length(y, sample_aligned_params)
File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/utils/validation.py", line 383, in check_consistent_length
raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [640, 1]
Issues with this https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py#L365 when we check here https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/calibration.py#L283
Versions
System:
python: 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0]
executable: /home/utilisateur/anaconda3/envs/sklearn11/bin/python3
machine: Linux-5.13.0-41-generic-x86_64-with-glibc2.31
Python dependencies:
sklearn: 1.1.0
pip: 21.2.4
setuptools: 49.2.0
numpy: 1.21.0
scipy: 1.8.0
Cython: None
pandas: 1.1.5
matplotlib: 3.3.4
joblib: 1.0.1
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
prefix: libgomp
filepath: /home/utilisateur/anaconda3/envs/sklearn11/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None
num_threads: 12
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/utilisateur/anaconda3/envs/sklearn11/lib/python3.9/site-packages/numpy.libs/libopenblasp-r0-5bebc122.3.13.dev.so
version: 0.3.13.dev
threading_layer: pthreads
architecture: Haswell
num_threads: 12
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/utilisateur/anaconda3/envs/sklearn11/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-8b9e111f.3.17.so
version: 0.3.17
threading_layer: pthreads
architecture: Haswell
num_threads: 12
Issue Analytics
- State:
- Created a year ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
cross validation alueerror('found input variables ... - You.com
ValueError: Found input variables with inconsistent numbers of samples: [2750, ... numbers of samples issue with fit_params in CalibratedClassifierCV#23422.
Read more >sklearn: Found arrays with inconsistent numbers of samples ...
It looks like sklearn requires the data shape of (row number, column number). If your data shape is (row number, ) like (999,...
Read more >sklearn.calibration.CalibratedClassifierCV
If None, then samples are equally weighted. **fit_paramsdict. Parameters to pass to the fit method of the underlying classifier. Returns ...
Read more >valueerror found input variables with inconsistent numbers of ...
Here I was creating a model based on multiple linear regression, but now I stuck with an ... = LinearRegression() regressor.fit( x_train, ...
Read more >Scikit correct way to calibrate classifiers with ... - Cross Validated
There are two things mentioned in the CalibratedClassifierCV docs that hint towards the ways ... Increasing the number of samples to 10,000:.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
For the sake of consistency I think we should be lenient in this case and behave similarly in
CalibratedClassifierCV
as we do inGridSearchCV
.We already do call
_check_fit_params
in_fit_classifier_calibrator_pair
so we implicitly assumed that no all the fit params are sampled align so we should probably remove the call tocheck_consistent_length
onfit_params.values()
inCalibratedClassifierCV.fit
.