Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Predict fails with category error

See original GitHub issue

Describe the bug

Autosklearn fails with a category data error on a particular dataset.

To Reproduce

Run:

import pandas as pd

from autosklearn.classification import AutoSklearnClassifier
from autosklearn.metrics import roc_auc

PATH_TRAIN = 'train.csv' # download from https://www.kaggle.com/c/tabular-playground-series-mar-2021/data?select=train.csv
PATH_TEST = 'test.csv' # download from https://www.kaggle.com/c/tabular-playground-series-mar-2021/data?select=test.csv

train = pd.read_csv(PATH_TRAIN)
test = pd.read_csv(PATH_TEST)

train_test = pd.concat([train, test])

for categorical_column in [f'cat{i}' for i in range(19)]:
    train_test[categorical_column] = train_test[categorical_column].astype('category')

train = train_test[:train.shape[0]]
test = train_test[train.shape[0]:]

target = train.target.values
train.drop(['id', 'target'], axis=1, inplace=True)

autosklearnml = AutoSklearnClassifier(
    time_left_for_this_task=600,
    metric=roc_auc,
    scoring_functions=[roc_auc]
)

autosklearnml.fit(X=train, y=target, dataset_name='tps_mar_2021')

preds_autosklearnml = autosklearnml.predict_proba(test[train.columns])

Logfile

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-b88885a0c2eb> in <module>
      1 ## generate predictions
----> 2 preds_autosklearnml = autosklearnml.predict_proba(test[train.columns])

/opt/conda/lib/python3.7/site-packages/autosklearn/estimators.py in predict_proba(self, X, batch_size, n_jobs)
    718         """
    719         pred_proba = super().predict_proba(
--> 720             X, batch_size=batch_size, n_jobs=n_jobs)
    721 
    722         # Check if all probabilities sum up to 1.

/opt/conda/lib/python3.7/site-packages/autosklearn/estimators.py in predict_proba(self, X, batch_size, n_jobs)
    501     def predict_proba(self, X, batch_size=None, n_jobs=1):
    502         return self.automl_.predict_proba(
--> 503              X, batch_size=batch_size, n_jobs=n_jobs)
    504 
    505     def score(self, X, y):

/opt/conda/lib/python3.7/site-packages/autosklearn/automl.py in predict_proba(self, X, batch_size, n_jobs)
   1655 
   1656     def predict_proba(self, X, batch_size=None, n_jobs=1):
-> 1657         return super().predict(X, batch_size=batch_size, n_jobs=n_jobs)
   1658 
   1659 

/opt/conda/lib/python3.7/site-packages/autosklearn/automl.py in predict(self, X, batch_size, n_jobs)
   1169                 models[identifier], X, batch_size, self._logger, self._task
   1170             )
-> 1171             for identifier in self.ensemble_.get_selected_model_identifiers()
   1172         )
   1173 

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
   1039             # remaining jobs.
   1040             self._iterating = False
-> 1041             if self.dispatch_one_batch(iterator):
   1042                 self._iterating = self._original_iterator is not None
   1043 

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    857                 return False
    858             else:
--> 859                 self._dispatch(tasks)
    860                 return True
    861 

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    775         with self._lock:
    776             job_idx = len(self._jobs)
--> 777             job = self._backend.apply_async(batch, callback=cb)
    778             # A job can complete so quickly than its callback is
    779             # called before we get here, causing self._jobs to

/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    262             return [func(*args, **kwargs)
--> 263                     for func, args, kwargs in self.items]
    264 
    265     def __reduce__(self):

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    262             return [func(*args, **kwargs)
--> 263                     for func, args, kwargs in self.items]
    264 
    265     def __reduce__(self):

/opt/conda/lib/python3.7/site-packages/autosklearn/automl.py in _model_predict(model, X, batch_size, logger, task)
     94                 prediction = model.predict_proba(X_, batch_size=batch_size)
     95             else:
---> 96                 prediction = model.predict_proba(X_)
     97 
     98             # Check that all probability values lie between 0 and 1.

/opt/conda/lib/python3.7/site-packages/autosklearn/pipeline/classification.py in predict_proba(self, X, batch_size)
    117         """
    118         if batch_size is None:
--> 119             return super().predict_proba(X)
    120 
    121         else:

/opt/conda/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    118 
    119         # lambda, but not partial, allows help() to work with update_wrapper
--> 120         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    121         # update the docstring of the returned function
    122         update_wrapper(out, self.fn)

/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in predict_proba(self, X)
    472         Xt = X
    473         for _, name, transform in self._iter(with_final=False):
--> 474             Xt = transform.transform(Xt)
    475         return self.steps[-1][-1].predict_proba(Xt)
    476 

/opt/conda/lib/python3.7/site-packages/autosklearn/pipeline/components/data_preprocessing/data_preprocessing.py in transform(self, X)
    112                              "while trying to fit the model."
    113                              )
--> 114         return self.column_transformer.transform(X)
    115 
    116     def fit_transform(self, X, y=None):

/opt/conda/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in transform(self, X)
    563                 "data given during fit."
    564             )
--> 565         Xs = self._fit_transform(X, None, _transform_one, fitted=True)
    566         self._validate_output(Xs)
    567 

/opt/conda/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
    442                     message=self._log_message(name, idx, len(transformers)))
    443                 for idx, (name, trans, column, weight) in enumerate(
--> 444                         self._iter(fitted=fitted, replace_strings=True), 1))
    445         except ValueError as e:
    446             if "Expected 2D array, got 1D array instead" in str(e):

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
   1039             # remaining jobs.
   1040             self._iterating = False
-> 1041             if self.dispatch_one_batch(iterator):
   1042                 self._iterating = self._original_iterator is not None
   1043 

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    857                 return False
    858             else:
--> 859                 self._dispatch(tasks)
    860                 return True
    861 

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    775         with self._lock:
    776             job_idx = len(self._jobs)
--> 777             job = self._backend.apply_async(batch, callback=cb)
    778             # A job can complete so quickly than its callback is
    779             # called before we get here, causing self._jobs to

/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    262             return [func(*args, **kwargs)
--> 263                     for func, args, kwargs in self.items]
    264 
    265     def __reduce__(self):

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    262             return [func(*args, **kwargs)
--> 263                     for func, args, kwargs in self.items]
    264 
    265     def __reduce__(self):

/opt/conda/lib/python3.7/site-packages/sklearn/utils/fixes.py in __call__(self, *args, **kwargs)
    220     def __call__(self, *args, **kwargs):
    221         with config_context(**self.config):
--> 222             return self.function(*args, **kwargs)

/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in _transform_one(transformer, X, y, weight, **fit_params)
    731 
    732 def _transform_one(transformer, X, y, weight, **fit_params):
--> 733     res = transformer.transform(X)
    734     # if we have a weight for this transformer, multiply output
    735     if weight is None:

/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in _transform(self, X)
    558         Xt = X
    559         for _, _, transform in self._iter():
--> 560             Xt = transform.transform(Xt)
    561         return Xt
    562 

/opt/conda/lib/python3.7/site-packages/autosklearn/pipeline/components/data_preprocessing/category_shift/category_shift.py in transform(self, X)
     25         if self.preprocessor is None:
     26             raise NotImplementedError()
---> 27         return self.preprocessor.transform(X)
     28 
     29     def fit_transform(self, X, y=None):

/opt/conda/lib/python3.7/site-packages/autosklearn/pipeline/implementations/CategoryShift.py in transform(self, X)
     31 
     32     def transform(self, X):
---> 33         X = self._convert_and_check_X(X)
     34         # Increment everything by three to account for the fact that
     35         # np.NaN will get an index of two, and coalesced values will get index of

/opt/conda/lib/python3.7/site-packages/autosklearn/pipeline/implementations/CategoryShift.py in _convert_and_check_X(self, X)
     17         # Check if data is numeric and positive
     18         if X_data.dtype.kind not in set('buif') or np.nanmin(X_data) < 0:
---> 19             raise ValueError('Categories should be non-negative numbers. '
     20                              'NOTE: floats will be casted to integers.')
     21 

ValueError: Categories should be non-negative numbers. NOTE: floats will be casted to integers.

Environment and installation:

Environment is as per this docker on Kaggle

Auto-sklearn version: 0.12.6

Comments

This same code works on other datasets, even those that have categorical data.
The first three rows of test data that fails are 55036th, 87124th, 89318th (indices 55035, 87123, 89317 of test dataframe)
Seems to be same as #970

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

TheGBGcommented, Jun 25, 2021

Thanks!

1reaction

mfeurercommented, May 4, 2021

Thanks a lot for reporting this issue. We have currently started revamping a lot of internals (see #1135) and will continue removing a lot of custom code such as the failing code mentioned here. We’ll check back on this issue once this refactor is done.