Woodwork type inference throws a "different data types" error when trying to predict labels on test data
See original GitHub issueVersion: evalml = “0.43.0” woodwork = “0.11.1”
I am using evalml for a benchmark and among the chosen datasets are higgs and kick. The problem is that I split the datasets and run the analysis, I get an error saying "Input X data types are different from the input types the pipeline was fitted on."
. After looking into it, I realised woodwork
inferred schema decides that some of the variables in the test set have a different type than the one assigned to them during the automl search and fitting. My guess is that this crashes when it classifies extra variables as unknown
type.
Code Sample, a copy-pastable example to reproduce your bug.
To replicate this error, I have downloaded the `.csv’ files from the OpenML repo and have performed a 80/20 stratified split on the dataset
import pandas as pd
from evalml import automl
from sklearn.model_selection import train_test_split
from evalml.utils import infer_feature_types
dataset = pd.read_csv("Binary Classification/phpZLgL9q.csv")
target = dataset[dataset.columns[-1]] # the target is the last column
data = dataset.drop(columns=[dataset.columns[-1]], axis=1)
train_data, test_data, train_target, test_target = train_test_split(
data, target, test_size=0.2, random_state=42, stratify=target
)
eval_automl = automl.AutoMLSearch(
X_train=train_data,
y_train=train_target,
problem_type="binary",
objective="f1",
max_time=30,
ensembling=False,
verbose=True,
)
eval_automl.search()
eval_automl.best_pipeline.predict(test_data)
You can see the evalml logs here, I omit most of the training logs that go as expected. for both datasets, I get the same error:
Removing columns ['MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice', 'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice'] because they are of 'Unknown' type
Generating pipelines to search over...
...
Search finished after 00:37
Best pipeline: Elastic Net Classifier w/ Label Encoder + Drop Columns Transformer + Imputer + One Hot Encoder + Undersampler + Standard Scaler
Best pipeline F1: 0.410122
# in the predict() phase the error occurs
File "ww_tester.py", line 22, in <module>
eval_automl.best_pipeline.predict(test_data)
File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/pipeline_meta.py", line 33, in _check_for_fit
return method(self, *args, **kwargs)
File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/classification_pipeline.py", line 116, in predict
_predictions = self._predict(X, objective=objective)
File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/binary_classification_pipeline.py", line 72, in _predict
ypred_proba = self.predict_proba(X)
File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/pipeline_meta.py", line 33, in _check_for_fit
return method(self, *args, **kwargs)
File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/binary_classification_pipeline.py", line 87, in predict_proba
return super().predict_proba(X)
File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/pipeline_meta.py", line 33, in _check_for_fit
return method(self, *args, **kwargs)
File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/classification_pipeline.py", line 141, in predict_proba
X = self.transform_all_but_final(X, y=None)
File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/pipeline_base.py", line 271, in transform_all_but_final
return self.component_graph.transform_all_but_final(X, y=y)
File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/component_graph.py", line 259, in transform_all_but_final
features, _ = self._fit_transform_features_helper(False, X, y)
File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/component_graph.py", line 282, in _fit_transform_features_helper
evaluate_training_only_components=needs_fitting,
File "/home/mlearning/.cache/pypoetry/virtualenvs/sphynxml-poetry-XCtkmD2k-py3.7/lib/python3.7/site-packages/evalml/pipelines/component_graph.py", line 393, in _transform_features
"Input X data types are different from the input types the pipeline was fitted on."
ValueError: Input X data types are different from the input types the pipeline was fitted on.
If I run the woodwork
infer schema function outside of evalml, we can see that different variables have been identified as unknown
type, hence the error.
import pandas as pd
from sklearn.model_selection import train_test_split
from evalml.utils import infer_feature_types
dataset = pd.read_csv("kick.csv")
target = dataset[dataset.columns[-1]] # the target is the last column
data = dataset.drop(columns=[dataset.columns[-1]], axis=1)
train_data, test_data, train_target, test_target = train_test_split(
data, target, test_size=0.2, random_state=42, stratify=target
)
ww_train = infer_feature_types(train_data)
ww_test = infer_feature_types(test_data)
unknown_columns = list(ww_train.ww.select("unknown", return_schema=True).columns)
unknown_columns_test = list(ww_test.ww.select("unknown", return_schema=True).columns)
print(unknown_columns)
print(unknown_columns_test)
>>> ['MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice', 'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice']
>>>['MMRAcquisitionAuctionAveragePrice', 'MMRAcquisitionAuctionCleanPrice', 'MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice', 'MMRCurrentAuctionAveragePrice', 'MMRCurrentAuctionCleanPrice', 'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice']
# same code for the higgs dataset, the prints show:
>>>Columns (19,20,21,22,23,24,25,26,27) have mixed types.Specify dtype option on import or set low_memory=False.
>>>['m_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']
>>>['jet4phi', 'm_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']
It is not clear to me as to why this error occurs. I have looked into the data and the splits and found no major differences in their content to justify the difference in their dtype identification.
Any help or insights would be appreciated.
Issue Analytics
- State:
- Created a year ago
- Comments:8
Top GitHub Comments
Sounds good! Sorry took so long to look into it. Thanks for using EvalML and Woodwork! Check out our latest versions - we’re up to EvalML 0.52.0 and WW 0.16.3 now!
hmm, it could possibly be corrupted due to the
utf-8
decoding I do, still no errors show when running it. I need.csv
files for storing and reviewing purposes of the data by others. Good news is I saved the train/test partitions as.csv
and then loaded them back as dataframes to see if the error reoccurs, but it did not. So, to sum up, let me change some things in my pipeline and see if I can bypass the error myself before closing this issue. I will come up tomorrow with a final comment and close it myself, I just need to make sure I am not missing anything else that could contribute to this result.