question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fastparquet to support column MultiIndex

See original GitHub issue

Hi,

I have a DataFrame with a column MultiIndex, and fastparquet does not want it.

import os
import pandas as pd
import fastparquet as fp

file = os.path.expanduser('~/Documents/code/data/fp_test')

# Dataset
ts = [pd.Timestamp('2021/01/01 08:00:00'), pd.Timestamp('2021/01/05 10:00:00')]
val = [10, 34]
df = pd.DataFrame({'val': val, 'ts': ts})
tuples = [(col, point) for col, point in zip(df.columns, [0, ''])]
midx = pd.MultiIndex.from_tuples(tuples, names=('component', 'point'))
df.columns = midx

# Write
fp.write(file, df, file_scheme='hive')
In [20]: df
Out[20]: 
component val                  ts
point       0                    
0          10 2021-01-01 08:00:00
1          34 2021-01-05 10:00:00

Result

fp.write(file, df, file_scheme='hive')
Traceback (most recent call last):

  File "<ipython-input-19-d8cbdb47bc3b>", line 1, in <module>
    fp.write(file, df, file_scheme='hive')

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/fastparquet-0.6.3-py3.8-linux-x86_64.egg/fastparquet/writer.py", line 859, in write
    fmd = make_metadata(data, has_nulls=has_nulls, ignore_columns=ignore,

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/fastparquet-0.6.3-py3.8-linux-x86_64.egg/fastparquet/writer.py", line 667, in make_metadata
    get_column_metadata(data[column], column))

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/fastparquet-0.6.3-py3.8-linux-x86_64.egg/fastparquet/util.py", line 263, in get_column_metadata
    raise TypeError(

TypeError: Column name must be a string. Got column ('val', 0) of type tuple

Additionnally, to apply columns and filters parameters from to_pandas on such a recorded DataFrame, these parameters would have to accept Tuple for column names instead of strings.

I have found at least ticket #409 related to column multi-index that seems to have been closed with a PR. Has this been supported in the past?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, May 24, 2021

Parquet does not allow non-string column names of any sort. The layout you have could perhaps be regarded as a nested structure - but fastparquet does not currently support writing this. The simple workaround would be to encode the column names in a flat scheme:

df.columns = ['.'.join([str(c) for c in col]).strip() for col in df.columns.values]

and write that instead. It is conceivable that fastparquet could do that, and save the hierarchical column layout in the parquet pandas metadata, to reconstruct on load. You may wish to check how arrow handles this case. I think actually writing a structured schema is off the table for fastparquet.

0reactions
blusscommented, Nov 2, 2021

thanks for the invitation. I’m prone to being nerdsniped by problems but I’ll try my best, for my sake, to avoid getting drawn into this. (Don’t) wish me luck… 🙂

Read more comments on GitHub >

github_iconTop Results From Across the Web

ENH: support for reading MultiIndex · Issue #262 - GitHub
fastparquet supports writing multiple index levels and the pandas ... DataFrame(np.random.randn(10, 3), columns=list('abc'), index=pd.
Read more >
Release Notes — fastparquet 0.7.1 documentation
Fastparquet used to cast such columns to float, so that we could represent NULLs as NaN; now we use the new(er) masked types...
Read more >
How do I save multi-indexed pandas dataframes to parquet?
Show activity on this post. pyarrow can write pandas multi-index to parquet files. ... DataFrame(np.random.rand(6,4)) df_test.columns = pd.
Read more >
fastparquet Documentation - Read the Docs
The properties columns, count, dtypes and statistics are available to assist with this, and a summary in info. In addition, if the data...
Read more >
What's new in 1.5.0 (September 19, 2022) - Pandas
Currently timezones in datetime columns are not preserved when a dataframe ... MultiIndex.to_frame() now supports the argument allow_duplicates and raises ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found