question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Categorizer should sort categories

See original GitHub issue

Hello everyone,

We are facing a problem when calling dd.get_dumies (or DummyEncoder) when using Categorizer to infer the categories.

The problem seems to arise when two columns have the same categorical values that appear in a different order. In that case, we will get a ValueError: The columns in the computed data do not match the columns in the provided metadata Order of columns does not match

The example below shows that get_dummies works fine when we explicitly define the categories in the same order. But when Categorizer infers the categories (in a different order) we will get the ValueError.

We would expected get_dummies to work in both cases.

Thanks for the great work.

Milton

import dask.dataframe as dd
import pandas as pd
from dask_ml.preprocessing import Categorizer
from pandas.api.types import CategoricalDtype

pdf = pd.DataFrame(
    {
        "c1": ["a", "c"],
        "c2": ["c", "a"],
        "c3": ["d", "d"],
    },
)


# setting categories explicitly in the same order works
ddf = dd.from_pandas(pdf, npartitions=1)
cat = Categorizer(
    categories={
        "c1": CategoricalDtype(categories=["a", "c"], ordered=False),
        "c2": CategoricalDtype(categories=["a", "c"], ordered=False),
        "c3": CategoricalDtype(categories=["d"], ordered=False),
    }
)
ddf = cat.fit_transform(ddf)
print(dd.get_dummies(ddf).compute())


# if categorizer infers the categories in a different
# order we'll get an exception on get_dummies
ddf = dd.from_pandas(pdf, npartitions=1)

cat = Categorizer()
ddf = cat.fit_transform(ddf)

print(ddf.compute())
# this will show that categories are inferred as 
# ['a', 'c'] and ['c', 'a'] for c1 and c2 respectively
# I believe this is causing the problem
print(cat.categories_)
print(dd.get_dummies(ddf).compute())

Environment:

  • Dask version: 2022.4.0
  • Python version: 3.9
  • Operating System: Ubuntu 20.04
  • Install method (conda, pip, source): conda

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:2
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jsignellcommented, Apr 7, 2022

Well I had trouble reproducing without dask-ml, but I did reproduce it with, and I think you are right in pointing to the difference in categories order as the source of the issue. In particular, it looks like the columns are just in a different order in the _meta (the tiny version of the df that we use to know what the dataframe looks like). than they are in the computed dataframe. Normally you can get around an issue like that by including enforce_metadata=False, but that’s not quite the case for get_dummies since they have a special way of calculating meta. I am opening a PR that will make it less special. After that PR gets in you’ll be able to do dd.get_dummies(ddf, enforce_metadata=False).compute(). Do you think this is good enough? I think the other option would be to try to sort the output columns which would make the resulting columns order not necessarily match pandas.

1reaction
jsignellcommented, Apr 7, 2022

Thanks for writing this up @miltava! I am going to see if I can reproduce without dask-ml.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Categorize Household Items (Ultimate guide for ...
Macro-Categorize first. Once you have decided what will stay, start creating categories. The best way to approach this part of the ...
Read more >
Classify, Categorize, Sort Teaching Resources - TPT
Have your students work on their categorizing/ sorting/ and classification skills with these cut and paste worksheets. Students will put ...
Read more >
Sort vs. categorize as in note organization
Sort and categorize are similar in meaning, since sorting includes placing things in certain groups (categories) based on relation to each other ...
Read more >
What is the difference between 'sort' and 'categorize ... - Quora
To “sort” is to rank by an attribute. So, for example, we might sort a class of students by height or GPA. To...
Read more >
Sorting and categorizing: what doesn't belong? - YouTube
Students learn to sort items into categories in order to see patterns. Sorting is also helpful for ... Sorting can be fun and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found