Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Writing and reading known categorical to parquet results in forgetting categories

See original GitHub issue

To reproduce:

create dask dataframe with categorical column with known categories
save it to parquet
read it from parquet
categories are unknown.

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame(data=list('abcaabbcc'), columns=['col'])
df.col = df.col.astype('category')
ddf = dd.from_pandas(df, npartitions=1)

ddf.to_parquet('tmp')
ddf2 = dd.read_parquet('tmp')

>>> ddf.col
Dask Series Structure:
npartitions=1
0    category[known]
8                ...
Name: col, dtype: category
Dask Name: getitem, 2 tasks

>>> ddf2.col
Dask Series Structure:
npartitions=1
0    category[unknown]
8                  ...
Name: col, dtype: category
Dask Name: getitem, 2 tasks

dask: 0.16.0 fastparquet: 0.1.3 pandas: 0.21.0

Issue Analytics

State:
Created 6 years ago
Comments:10 (7 by maintainers)

Top GitHub Comments

1reaction

McToelcommented, Jan 12, 2021

Shouldn’t this be mentioned in documentation about categoricals? I had this issue come up while working with categoricals, and it took me some time to figure out how to fix it.

1reaction

TomAugspurgercommented, Dec 1, 2017

https://github.com/dask/dask/issues/2947 for that last issue with astype.

I think something like

In [13]: ddf.col.cat.set_categories(dtype.categories)
Out[13]:
Dask Series Structure:
npartitions=1
0    category[known]
8                ...
Name: col, dtype: category
Dask Name: cat, 3 tasks

will work in the meantime.