Predict fails with category error
See original GitHub issueDescribe the bug
Autosklearn fails with a category data error on a particular dataset.
To Reproduce
Run:
import pandas as pd
from autosklearn.classification import AutoSklearnClassifier
from autosklearn.metrics import roc_auc
PATH_TRAIN = 'train.csv' # download from https://www.kaggle.com/c/tabular-playground-series-mar-2021/data?select=train.csv
PATH_TEST = 'test.csv' # download from https://www.kaggle.com/c/tabular-playground-series-mar-2021/data?select=test.csv
train = pd.read_csv(PATH_TRAIN)
test = pd.read_csv(PATH_TEST)
train_test = pd.concat([train, test])
for categorical_column in [f'cat{i}' for i in range(19)]:
train_test[categorical_column] = train_test[categorical_column].astype('category')
train = train_test[:train.shape[0]]
test = train_test[train.shape[0]:]
target = train.target.values
train.drop(['id', 'target'], axis=1, inplace=True)
autosklearnml = AutoSklearnClassifier(
time_left_for_this_task=600,
metric=roc_auc,
scoring_functions=[roc_auc]
)
autosklearnml.fit(X=train, y=target, dataset_name='tps_mar_2021')
preds_autosklearnml = autosklearnml.predict_proba(test[train.columns])
Logfile
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-b88885a0c2eb> in <module>
1 ## generate predictions
----> 2 preds_autosklearnml = autosklearnml.predict_proba(test[train.columns])
/opt/conda/lib/python3.7/site-packages/autosklearn/estimators.py in predict_proba(self, X, batch_size, n_jobs)
718 """
719 pred_proba = super().predict_proba(
--> 720 X, batch_size=batch_size, n_jobs=n_jobs)
721
722 # Check if all probabilities sum up to 1.
/opt/conda/lib/python3.7/site-packages/autosklearn/estimators.py in predict_proba(self, X, batch_size, n_jobs)
501 def predict_proba(self, X, batch_size=None, n_jobs=1):
502 return self.automl_.predict_proba(
--> 503 X, batch_size=batch_size, n_jobs=n_jobs)
504
505 def score(self, X, y):
/opt/conda/lib/python3.7/site-packages/autosklearn/automl.py in predict_proba(self, X, batch_size, n_jobs)
1655
1656 def predict_proba(self, X, batch_size=None, n_jobs=1):
-> 1657 return super().predict(X, batch_size=batch_size, n_jobs=n_jobs)
1658
1659
/opt/conda/lib/python3.7/site-packages/autosklearn/automl.py in predict(self, X, batch_size, n_jobs)
1169 models[identifier], X, batch_size, self._logger, self._task
1170 )
-> 1171 for identifier in self.ensemble_.get_selected_model_identifiers()
1172 )
1173
/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
1039 # remaining jobs.
1040 self._iterating = False
-> 1041 if self.dispatch_one_batch(iterator):
1042 self._iterating = self._original_iterator is not None
1043
/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
857 return False
858 else:
--> 859 self._dispatch(tasks)
860 return True
861
/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
775 with self._lock:
776 job_idx = len(self._jobs)
--> 777 job = self._backend.apply_async(batch, callback=cb)
778 # A job can complete so quickly than its callback is
779 # called before we get here, causing self._jobs to
/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 """Schedule a func to be run"""
--> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)
/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
570 # Don't delay the application, to avoid keeping the input
571 # arguments in memory
--> 572 self.results = batch()
573
574 def get(self):
/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
262 return [func(*args, **kwargs)
--> 263 for func, args, kwargs in self.items]
264
265 def __reduce__(self):
/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
262 return [func(*args, **kwargs)
--> 263 for func, args, kwargs in self.items]
264
265 def __reduce__(self):
/opt/conda/lib/python3.7/site-packages/autosklearn/automl.py in _model_predict(model, X, batch_size, logger, task)
94 prediction = model.predict_proba(X_, batch_size=batch_size)
95 else:
---> 96 prediction = model.predict_proba(X_)
97
98 # Check that all probability values lie between 0 and 1.
/opt/conda/lib/python3.7/site-packages/autosklearn/pipeline/classification.py in predict_proba(self, X, batch_size)
117 """
118 if batch_size is None:
--> 119 return super().predict_proba(X)
120
121 else:
/opt/conda/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
118
119 # lambda, but not partial, allows help() to work with update_wrapper
--> 120 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
121 # update the docstring of the returned function
122 update_wrapper(out, self.fn)
/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in predict_proba(self, X)
472 Xt = X
473 for _, name, transform in self._iter(with_final=False):
--> 474 Xt = transform.transform(Xt)
475 return self.steps[-1][-1].predict_proba(Xt)
476
/opt/conda/lib/python3.7/site-packages/autosklearn/pipeline/components/data_preprocessing/data_preprocessing.py in transform(self, X)
112 "while trying to fit the model."
113 )
--> 114 return self.column_transformer.transform(X)
115
116 def fit_transform(self, X, y=None):
/opt/conda/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in transform(self, X)
563 "data given during fit."
564 )
--> 565 Xs = self._fit_transform(X, None, _transform_one, fitted=True)
566 self._validate_output(Xs)
567
/opt/conda/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
442 message=self._log_message(name, idx, len(transformers)))
443 for idx, (name, trans, column, weight) in enumerate(
--> 444 self._iter(fitted=fitted, replace_strings=True), 1))
445 except ValueError as e:
446 if "Expected 2D array, got 1D array instead" in str(e):
/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
1039 # remaining jobs.
1040 self._iterating = False
-> 1041 if self.dispatch_one_batch(iterator):
1042 self._iterating = self._original_iterator is not None
1043
/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
857 return False
858 else:
--> 859 self._dispatch(tasks)
860 return True
861
/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
775 with self._lock:
776 job_idx = len(self._jobs)
--> 777 job = self._backend.apply_async(batch, callback=cb)
778 # A job can complete so quickly than its callback is
779 # called before we get here, causing self._jobs to
/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 """Schedule a func to be run"""
--> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)
/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
570 # Don't delay the application, to avoid keeping the input
571 # arguments in memory
--> 572 self.results = batch()
573
574 def get(self):
/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
262 return [func(*args, **kwargs)
--> 263 for func, args, kwargs in self.items]
264
265 def __reduce__(self):
/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
262 return [func(*args, **kwargs)
--> 263 for func, args, kwargs in self.items]
264
265 def __reduce__(self):
/opt/conda/lib/python3.7/site-packages/sklearn/utils/fixes.py in __call__(self, *args, **kwargs)
220 def __call__(self, *args, **kwargs):
221 with config_context(**self.config):
--> 222 return self.function(*args, **kwargs)
/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in _transform_one(transformer, X, y, weight, **fit_params)
731
732 def _transform_one(transformer, X, y, weight, **fit_params):
--> 733 res = transformer.transform(X)
734 # if we have a weight for this transformer, multiply output
735 if weight is None:
/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in _transform(self, X)
558 Xt = X
559 for _, _, transform in self._iter():
--> 560 Xt = transform.transform(Xt)
561 return Xt
562
/opt/conda/lib/python3.7/site-packages/autosklearn/pipeline/components/data_preprocessing/category_shift/category_shift.py in transform(self, X)
25 if self.preprocessor is None:
26 raise NotImplementedError()
---> 27 return self.preprocessor.transform(X)
28
29 def fit_transform(self, X, y=None):
/opt/conda/lib/python3.7/site-packages/autosklearn/pipeline/implementations/CategoryShift.py in transform(self, X)
31
32 def transform(self, X):
---> 33 X = self._convert_and_check_X(X)
34 # Increment everything by three to account for the fact that
35 # np.NaN will get an index of two, and coalesced values will get index of
/opt/conda/lib/python3.7/site-packages/autosklearn/pipeline/implementations/CategoryShift.py in _convert_and_check_X(self, X)
17 # Check if data is numeric and positive
18 if X_data.dtype.kind not in set('buif') or np.nanmin(X_data) < 0:
---> 19 raise ValueError('Categories should be non-negative numbers. '
20 'NOTE: floats will be casted to integers.')
21
ValueError: Categories should be non-negative numbers. NOTE: floats will be casted to integers.
Environment and installation:
Environment is as per this docker on Kaggle
Auto-sklearn version: 0.12.6
Comments
- This same code works on other datasets, even those that have categorical data.
- The first three rows of test data that fails are 55036th, 87124th, 89318th (indices 55035, 87123, 89317 of test dataframe)
- Seems to be same as #970
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
15.1 - Prediction Error | STAT 555
The prediction problem focuses on whether the samples are correctly classified to their category. The objective is to find a rule that performs...
Read more >Category Mistakes - Stanford Encyclopedia of Philosophy
3) category mistakes are contentful but truth-valueless but, at least according to many philosophers, so are some instances of vague sentences ...
Read more >How to handle errors in predict function of R? - Stack Overflow
First part of problem comes during training the model because categorical variables are not equally divided in between train and test if one...
Read more >Classifying Prediction Errors - Microsoft
Our categorization is relative to a particular training set T, feature set F, and learning algorithm L. We describe four categories of errors:...
Read more >Ignoring categorical variables for standard error with ...
One potential problem in the plots is that you are exponentiating to odds but plotting on the data (probability) scale. You can use...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks!
Thanks a lot for reporting this issue. We have currently started revamping a lot of internals (see #1135) and will continue removing a lot of custom code such as the failing code mentioned here. We’ll check back on this issue once this refactor is done.