question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Writing and reading known categorical to parquet results in forgetting categories

See original GitHub issue

To reproduce:

  • create dask dataframe with categorical column with known categories
  • save it to parquet
  • read it from parquet
  • categories are unknown.
import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame(data=list('abcaabbcc'), columns=['col'])
df.col = df.col.astype('category')
ddf = dd.from_pandas(df, npartitions=1)

ddf.to_parquet('tmp')
ddf2 = dd.read_parquet('tmp')
>>> ddf.col
Dask Series Structure:
npartitions=1
0    category[known]
8                ...
Name: col, dtype: category
Dask Name: getitem, 2 tasks

>>> ddf2.col
Dask Series Structure:
npartitions=1
0    category[unknown]
8                  ...
Name: col, dtype: category
Dask Name: getitem, 2 tasks

dask: 0.16.0 fastparquet: 0.1.3 pandas: 0.21.0

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
McToelcommented, Jan 12, 2021

Shouldn’t this be mentioned in documentation about categoricals? I had this issue come up while working with categoricals, and it took me some time to figure out how to fix it.

1reaction
TomAugspurgercommented, Dec 1, 2017

https://github.com/dask/dask/issues/2947 for that last issue with astype.

I think something like

In [13]: ddf.col.cat.set_categories(dtype.categories)
Out[13]:
Dask Series Structure:
npartitions=1
0    category[known]
8                ...
Name: col, dtype: category
Dask Name: cat, 3 tasks

will work in the meantime.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas DataFrame with categorical columns from a Parquet ...
But when I read the data into Pandas for further analysis using from_parquet I can not seem to recover the category dtypes. The...
Read more >
Categoricals - Dask documentation
Dask DataFrame divides categorical data into two types: Known categoricals have the ... If you write and read to parquet, Dask will forget...
Read more >
Reading and Writing the Apache Parquet Format
Categorical when converted to pandas. This option is only valid for string and binary column types, and it can yield significantly lower memory...
Read more >
*Deep Dive – Parquet for Spark – Azure Data Ninjago & dqops
In this blog post, I am going to dive into the vectorised Parquet file reading in Spark. Vectorised Parquet file reader is a...
Read more >
Best Practices for Amazon Redshift Spectrum | AWS Big Data ...
As of this writing, Amazon Redshift Spectrum supports Gzip, ... and maximum indexes and skips reading entire row groups for parquet files ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found