Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Woodwork type inference throws a "different data types" error when trying to predict labels on test data

See original GitHub issue

Version: evalml = “0.43.0” woodwork = “0.11.1”

I am using evalml for a benchmark and among the chosen datasets are higgs and kick. The problem is that I split the datasets and run the analysis, I get an error saying "Input X data types are different from the input types the pipeline was fitted on.". After looking into it, I realised woodwork inferred schema decides that some of the variables in the test set have a different type than the one assigned to them during the automl search and fitting. My guess is that this crashes when it classifies extra variables as unknown type.

Code Sample, a copy-pastable example to reproduce your bug.

To replicate this error, I have downloaded the `.csv’ files from the OpenML repo and have performed a 80/20 stratified split on the dataset

import pandas as pd
from evalml import automl
from sklearn.model_selection import train_test_split
from evalml.utils import infer_feature_types

dataset = pd.read_csv("Binary Classification/phpZLgL9q.csv")
target = dataset[dataset.columns[-1]]  # the target is the last column
data = dataset.drop(columns=[dataset.columns[-1]], axis=1)
train_data, test_data, train_target, test_target = train_test_split(
    data, target, test_size=0.2, random_state=42, stratify=target
)

eval_automl = automl.AutoMLSearch(
    X_train=train_data,
    y_train=train_target,
    problem_type="binary",
    objective="f1",
    max_time=30,
    ensembling=False,
    verbose=True,
)
eval_automl.search()
eval_automl.best_pipeline.predict(test_data)

You can see the evalml logs here, I omit most of the training logs that go as expected. for both datasets, I get the same error:

Removing columns ['MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice', 'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice'] because they are of 'Unknown' type
Generating pipelines to search over...
...
Search finished after 00:37            
Best pipeline: Elastic Net Classifier w/ Label Encoder + Drop Columns Transformer + Imputer + One Hot Encoder + Undersampler + Standard Scaler
Best pipeline F1: 0.410122
# in the predict() phase the error occurs
File "ww_tester.py", line 22, in <module>
    eval_automl.best_pipeline.predict(test_data)
  File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/pipeline_meta.py", line 33, in _check_for_fit
    return method(self, *args, **kwargs)
  File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/classification_pipeline.py", line 116, in predict
    _predictions = self._predict(X, objective=objective)
  File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/binary_classification_pipeline.py", line 72, in _predict
    ypred_proba = self.predict_proba(X)
  File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/pipeline_meta.py", line 33, in _check_for_fit
    return method(self, *args, **kwargs)
  File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/binary_classification_pipeline.py", line 87, in predict_proba
    return super().predict_proba(X)
  File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/pipeline_meta.py", line 33, in _check_for_fit
    return method(self, *args, **kwargs)
  File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/classification_pipeline.py", line 141, in predict_proba
    X = self.transform_all_but_final(X, y=None)
  File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/pipeline_base.py", line 271, in transform_all_but_final
    return self.component_graph.transform_all_but_final(X, y=y)
  File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/component_graph.py", line 259, in transform_all_but_final
    features, _ = self._fit_transform_features_helper(False, X, y)
  File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/component_graph.py", line 282, in _fit_transform_features_helper
    evaluate_training_only_components=needs_fitting,
  File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/component_graph.py", line 393, in _transform_features
    "Input X data types are different from the input types the pipeline was fitted on."
ValueError: Input X data types are different from the input types the pipeline was fitted on.

If I run the woodwork infer schema function outside of evalml, we can see that different variables have been identified as unknown type, hence the error.

import pandas as pd
from sklearn.model_selection import train_test_split
from evalml.utils import infer_feature_types

dataset = pd.read_csv("kick.csv")
target = dataset[dataset.columns[-1]]  # the target is the last column
data = dataset.drop(columns=[dataset.columns[-1]], axis=1)
train_data, test_data, train_target, test_target = train_test_split(
    data, target, test_size=0.2, random_state=42, stratify=target
)
ww_train = infer_feature_types(train_data)
ww_test = infer_feature_types(test_data)
unknown_columns = list(ww_train.ww.select("unknown", return_schema=True).columns)
unknown_columns_test = list(ww_test.ww.select("unknown", return_schema=True).columns)
print(unknown_columns)
print(unknown_columns_test)

>>> ['MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice', 'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice']
>>>['MMRAcquisitionAuctionAveragePrice', 'MMRAcquisitionAuctionCleanPrice', 'MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice', 'MMRCurrentAuctionAveragePrice', 'MMRCurrentAuctionCleanPrice', 'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice']

# same code for the higgs dataset, the prints show:
>>>Columns (19,20,21,22,23,24,25,26,27) have mixed types.Specify dtype option on import or set low_memory=False.
>>>['m_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']
>>>['jet4phi', 'm_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']

It is not clear to me as to why this error occurs. I have looked into the data and the splits and found no major differences in their content to justify the difference in their dtype identification.

Any help or insights would be appreciated.

Issue Analytics

State:
Created a year ago
Comments:8

Top GitHub Comments

1reaction

chukarstencommented, May 17, 2022

Sounds good! Sorry took so long to look into it. Thanks for using EvalML and Woodwork! Check out our latest versions - we’re up to EvalML 0.52.0 and WW 0.16.3 now!

1reaction

iXanthoscommented, May 17, 2022

hmm, it could possibly be corrupted due to the utf-8 decoding I do, still no errors show when running it. I need .csv files for storing and reviewing purposes of the data by others. Good news is I saved the train/test partitions as .csv and then loaded them back as dataframes to see if the error reoccurs, but it did not. So, to sum up, let me change some things in my pipeline and see if I can bypass the error myself before closing this issue. I will come up tomorrow with a final comment and close it myself, I just need to make sure I am not missing anything else that could contribute to this result.

Top Results From Across the Web

Keras model giving error when fields of unseen test data and ...

All input fields are categorical(converted by one hot encoding). Since unseen test data has some different categories thats why after one hot ...

classification 'predict model' error · Issue #640 · pycaret ...

Then I loaded unseen data set as a CSV into pandas and applied my predict_model() method on the loaded model and got scores....

EvalML Documentation - Alteryx

Update data types Woodwork did not correctly infer ... is only 1 unique label), the function will throw an error instead.

Introduction to R and RStudio - Data analysis using R

It operates on the idea of a “Read, evaluate, print loop”: you type in commands, R tries to execute them, and then returns...

Python Intro for Libraries

Learners are able to write code in the Spyder editor and run this code ... When you change the value of a variable...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

All pipelines in the current AutoML batch produced a score of np.nan on the primary objective <evalml.objectives.standard_metrics.LogLossBinary object at 0x7fd7a5826590>.

Woodwork type inference throws a "different data types" error when trying to predict labels on test data

Code Sample, a copy-pastable example to reproduce your bug.

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

All pipelines in the current AutoML batch produced a score of np.nan on the primary objective <evalml.objectives.standard_metrics.LogLossBinary object at 0x7fd7a5826590>.

Add support for pandas 1.4.0