Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fastparquet to support column MultiIndex

See original GitHub issue

Hi,

I have a DataFrame with a column MultiIndex, and fastparquet does not want it.

import os
import pandas as pd
import fastparquet as fp

file = os.path.expanduser('~/Documents/code/data/fp_test')

# Dataset
ts = [pd.Timestamp('2021/01/01 08:00:00'), pd.Timestamp('2021/01/05 10:00:00')]
val = [10, 34]
df = pd.DataFrame({'val': val, 'ts': ts})
tuples = [(col, point) for col, point in zip(df.columns, [0, ''])]
midx = pd.MultiIndex.from_tuples(tuples, names=('component', 'point'))
df.columns = midx

# Write
fp.write(file, df, file_scheme='hive')

In [20]: df
Out[20]: 
component val                  ts
point       0                    
0          10 2021-01-01 08:00:00
1          34 2021-01-05 10:00:00

Result

fp.write(file, df, file_scheme='hive')
Traceback (most recent call last):

  File "<ipython-input-19-d8cbdb47bc3b>", line 1, in <module>
    fp.write(file, df, file_scheme='hive')

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/fastparquet-0.6.3-py3.8-linux-x86_64.egg/fastparquet/writer.py", line 859, in write
    fmd = make_metadata(data, has_nulls=has_nulls, ignore_columns=ignore,

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/fastparquet-0.6.3-py3.8-linux-x86_64.egg/fastparquet/writer.py", line 667, in make_metadata
    get_column_metadata(data[column], column))

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/fastparquet-0.6.3-py3.8-linux-x86_64.egg/fastparquet/util.py", line 263, in get_column_metadata
    raise TypeError(

TypeError: Column name must be a string. Got column ('val', 0) of type tuple

Additionnally, to apply columns and filters parameters from to_pandas on such a recorded DataFrame, these parameters would have to accept Tuple for column names instead of strings.

I have found at least ticket #409 related to column multi-index that seems to have been closed with a PR. Has this been supported in the past?

Issue Analytics

State:
Created 2 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, May 24, 2021

Parquet does not allow non-string column names of any sort. The layout you have could perhaps be regarded as a nested structure - but fastparquet does not currently support writing this. The simple workaround would be to encode the column names in a flat scheme:

df.columns = ['.'.join([str(c) for c in col]).strip() for col in df.columns.values]

and write that instead. It is conceivable that fastparquet could do that, and save the hierarchical column layout in the parquet pandas metadata, to reconstruct on load. You may wish to check how arrow handles this case. I think actually writing a structured schema is off the table for fastparquet.

0reactions

blusscommented, Nov 2, 2021

thanks for the invitation. I’m prone to being nerdsniped by problems but I’ll try my best, for my sake, to avoid getting drawn into this. (Don’t) wish me luck… 🙂

Top Results From Across the Web

ENH: support for reading MultiIndex · Issue #262 - GitHub

fastparquet supports writing multiple index levels and the pandas ... DataFrame(np.random.randn(10, 3), columns=list('abc'), index=pd.

Release Notes — fastparquet 0.7.1 documentation

Fastparquet used to cast such columns to float, so that we could represent NULLs as NaN; now we use the new(er) masked types...

How do I save multi-indexed pandas dataframes to parquet?

Show activity on this post. pyarrow can write pandas multi-index to parquet files. ... DataFrame(np.random.rand(6,4)) df_test.columns = pd.

fastparquet Documentation - Read the Docs

The properties columns, count, dtypes and statistics are available to assist with this, and a summary in info. In addition, if the data...

What's new in 1.5.0 (September 19, 2022) - Pandas

Currently timezones in datetime columns are not preserved when a dataframe ... MultiIndex.to_frame() now supports the argument allow_duplicates and raises ...