Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Categorizer should sort categories

See original GitHub issue

Hello everyone,

We are facing a problem when calling dd.get_dumies (or DummyEncoder) when using Categorizer to infer the categories.

The problem seems to arise when two columns have the same categorical values that appear in a different order. In that case, we will get a ValueError: The columns in the computed data do not match the columns in the provided metadata Order of columns does not match

The example below shows that get_dummies works fine when we explicitly define the categories in the same order. But when Categorizer infers the categories (in a different order) we will get the ValueError.

We would expected get_dummies to work in both cases.

Thanks for the great work.

Milton

import dask.dataframe as dd
import pandas as pd
from dask_ml.preprocessing import Categorizer
from pandas.api.types import CategoricalDtype

pdf = pd.DataFrame(
    {
        "c1": ["a", "c"],
        "c2": ["c", "a"],
        "c3": ["d", "d"],
    },
)


# setting categories explicitly in the same order works
ddf = dd.from_pandas(pdf, npartitions=1)
cat = Categorizer(
    categories={
        "c1": CategoricalDtype(categories=["a", "c"], ordered=False),
        "c2": CategoricalDtype(categories=["a", "c"], ordered=False),
        "c3": CategoricalDtype(categories=["d"], ordered=False),
    }
)
ddf = cat.fit_transform(ddf)
print(dd.get_dummies(ddf).compute())


# if categorizer infers the categories in a different
# order we'll get an exception on get_dummies
ddf = dd.from_pandas(pdf, npartitions=1)

cat = Categorizer()
ddf = cat.fit_transform(ddf)

print(ddf.compute())
# this will show that categories are inferred as 
# ['a', 'c'] and ['c', 'a'] for c1 and c2 respectively
# I believe this is causing the problem
print(cat.categories_)
print(dd.get_dummies(ddf).compute())

Environment:

Dask version: 2022.4.0
Python version: 3.9
Operating System: Ubuntu 20.04
Install method (conda, pip, source): conda

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

jsignellcommented, Apr 7, 2022

Well I had trouble reproducing without dask-ml, but I did reproduce it with, and I think you are right in pointing to the difference in categories order as the source of the issue. In particular, it looks like the columns are just in a different order in the _meta (the tiny version of the df that we use to know what the dataframe looks like). than they are in the computed dataframe. Normally you can get around an issue like that by including enforce_metadata=False, but that’s not quite the case for get_dummies since they have a special way of calculating meta. I am opening a PR that will make it less special. After that PR gets in you’ll be able to do dd.get_dummies(ddf, enforce_metadata=False).compute(). Do you think this is good enough? I think the other option would be to try to sort the output columns which would make the resulting columns order not necessarily match pandas.

1reaction

jsignellcommented, Apr 7, 2022

Thanks for writing this up @miltava! I am going to see if I can reproduce without dask-ml.