Categorizer should sort categories
See original GitHub issueHello everyone,
We are facing a problem when calling dd.get_dumies (or DummyEncoder) when using Categorizer to infer the categories.
The problem seems to arise when two columns have the same categorical values that appear in a different order. In that case, we will get a ValueError: The columns in the computed data do not match the columns in the provided metadata Order of columns does not match
The example below shows that get_dummies works fine when we explicitly define the categories in the same order. But when Categorizer infers the categories (in a different order) we will get the ValueError.
We would expected get_dummies to work in both cases.
Thanks for the great work.
Milton
import dask.dataframe as dd
import pandas as pd
from dask_ml.preprocessing import Categorizer
from pandas.api.types import CategoricalDtype
pdf = pd.DataFrame(
{
"c1": ["a", "c"],
"c2": ["c", "a"],
"c3": ["d", "d"],
},
)
# setting categories explicitly in the same order works
ddf = dd.from_pandas(pdf, npartitions=1)
cat = Categorizer(
categories={
"c1": CategoricalDtype(categories=["a", "c"], ordered=False),
"c2": CategoricalDtype(categories=["a", "c"], ordered=False),
"c3": CategoricalDtype(categories=["d"], ordered=False),
}
)
ddf = cat.fit_transform(ddf)
print(dd.get_dummies(ddf).compute())
# if categorizer infers the categories in a different
# order we'll get an exception on get_dummies
ddf = dd.from_pandas(pdf, npartitions=1)
cat = Categorizer()
ddf = cat.fit_transform(ddf)
print(ddf.compute())
# this will show that categories are inferred as
# ['a', 'c'] and ['c', 'a'] for c1 and c2 respectively
# I believe this is causing the problem
print(cat.categories_)
print(dd.get_dummies(ddf).compute())
Environment:
- Dask version: 2022.4.0
- Python version: 3.9
- Operating System: Ubuntu 20.04
- Install method (conda, pip, source): conda
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:10 (5 by maintainers)
Top Results From Across the Web
How to Categorize Household Items (Ultimate guide for ...
Macro-Categorize first. Once you have decided what will stay, start creating categories. The best way to approach this part of the ...
Read more >Classify, Categorize, Sort Teaching Resources - TPT
Have your students work on their categorizing/ sorting/ and classification skills with these cut and paste worksheets. Students will put ...
Read more >Sort vs. categorize as in note organization
Sort and categorize are similar in meaning, since sorting includes placing things in certain groups (categories) based on relation to each other ...
Read more >What is the difference between 'sort' and 'categorize ... - Quora
To “sort” is to rank by an attribute. So, for example, we might sort a class of students by height or GPA. To...
Read more >Sorting and categorizing: what doesn't belong? - YouTube
Students learn to sort items into categories in order to see patterns. Sorting is also helpful for ... Sorting can be fun and...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Well I had trouble reproducing without dask-ml, but I did reproduce it with, and I think you are right in pointing to the difference in
categories
order as the source of the issue. In particular, it looks like the columns are just in a different order in the_meta
(the tiny version of the df that we use to know what the dataframe looks like). than they are in the computed dataframe. Normally you can get around an issue like that by includingenforce_metadata=False
, but that’s not quite the case forget_dummies
since they have a special way of calculating meta. I am opening a PR that will make it less special. After that PR gets in you’ll be able to dodd.get_dummies(ddf, enforce_metadata=False).compute()
. Do you think this is good enough? I think the other option would be to try to sort the output columns which would make the resulting columns order not necessarily match pandas.Thanks for writing this up @miltava! I am going to see if I can reproduce without dask-ml.