Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AutoMLSearch fails with Ordinal logical type input from Featuretools

See original GitHub issue

AutoMLSearch fails if the input contains Ordinal data from Featuretools, such as that generated by the Year, Month, etc primitives.

Code Sample, a copy-pastable example to reproduce your bug.

import featuretools as ft
from evalml import AutoMLSearch
import pandas as pd

df = pd.read_csv("delhi_200.csv")

es = ft.EntitySet()
es.add_dataframe(dataframe_name="df", dataframe=df, index="id", make_index=True, time_index="date")
es["df"].ww

trans_primitives = ["day"]
features = ft.dfs(entityset=es,
                  target_dataframe_name="df",
                  max_depth=1,
                  features_only=True,
                  trans_primitives=trans_primitives)
features.append(ft.Feature(es["df"].ww["date"]))
fm = ft.calculate_feature_matrix(entityset=es, features=features)
y = fm.ww.pop("meantemp")
X = fm

problem_configuration={"gap": 0, "max_delay": 7, "forecast_horizon": 7, "time_index": "date"}
automl = AutoMLSearch(
    X,
    y,
    problem_type="time series regression",
    problem_configuration=problem_configuration,
)

automl.search()

Random Forest Regressor w/ Replace Nullable Types Transformer + Imputer + Time Series Featurizer + DateTime Featurizer + One Hot Encoder + Drop NaN Rows Transformer fold 0: Encountered an error.
Random Forest Regressor w/ Replace Nullable Types Transformer + Imputer + Time Series Featurizer + DateTime Featurizer + One Hot Encoder + Drop NaN Rows Transformer fold 0: All scores will be replaced with nan.
Fold 0: Exception during automl search: Input contains NaN

...

AutoMLSearchException: All pipelines in the current AutoML batch produced a score of np.nan on the primary objective <evalml.objectives.standard_metrics.MedianAE object at 0x2898447c0>.

delhi_200.csv

Issue Analytics

State:
Created a year ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

tamargreycommented, Nov 2, 2022

@tamargrey Still trying to digest this information just a but, but how can we do a categorical to double conversion reliably since categories don’t have to be numeric?

@thehomebrewnerd I believe it happens via _encode_X_while_preserving_index, which will turn all the categories into numbers (here).

But the fact that _get_categorical_columns ignores Ordinal and other logical types with category standard tags means that those columns wouldn’t get ordinally encoded and if the data wasn’t already numeric in nature, we will have problems with any non numeric ordinal or category feature. It should be a quick fix, but I would want to talk to other folks on the modeling team before making this change.

0reactions

thehomebrewnerdcommented, Nov 2, 2022

@tamargrey Still trying to digest this information just a but, but how can we do a categorical to double conversion reliably since categories don’t have to be numeric?