Fastparquet to support column MultiIndex
See original GitHub issueHi,
I have a DataFrame with a column MultiIndex, and fastparquet does not want it.
import os
import pandas as pd
import fastparquet as fp
file = os.path.expanduser('~/Documents/code/data/fp_test')
# Dataset
ts = [pd.Timestamp('2021/01/01 08:00:00'), pd.Timestamp('2021/01/05 10:00:00')]
val = [10, 34]
df = pd.DataFrame({'val': val, 'ts': ts})
tuples = [(col, point) for col, point in zip(df.columns, [0, ''])]
midx = pd.MultiIndex.from_tuples(tuples, names=('component', 'point'))
df.columns = midx
# Write
fp.write(file, df, file_scheme='hive')
In [20]: df
Out[20]:
component val ts
point 0
0 10 2021-01-01 08:00:00
1 34 2021-01-05 10:00:00
Result
fp.write(file, df, file_scheme='hive')
Traceback (most recent call last):
File "<ipython-input-19-d8cbdb47bc3b>", line 1, in <module>
fp.write(file, df, file_scheme='hive')
File "/home/pierre/anaconda3/lib/python3.8/site-packages/fastparquet-0.6.3-py3.8-linux-x86_64.egg/fastparquet/writer.py", line 859, in write
fmd = make_metadata(data, has_nulls=has_nulls, ignore_columns=ignore,
File "/home/pierre/anaconda3/lib/python3.8/site-packages/fastparquet-0.6.3-py3.8-linux-x86_64.egg/fastparquet/writer.py", line 667, in make_metadata
get_column_metadata(data[column], column))
File "/home/pierre/anaconda3/lib/python3.8/site-packages/fastparquet-0.6.3-py3.8-linux-x86_64.egg/fastparquet/util.py", line 263, in get_column_metadata
raise TypeError(
TypeError: Column name must be a string. Got column ('val', 0) of type tuple
Additionnally, to apply columns
and filters
parameters from to_pandas
on such a recorded DataFrame, these parameters would have to accept Tuple
for column names instead of strings.
I have found at least ticket #409 related to column multi-index that seems to have been closed with a PR. Has this been supported in the past?
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
ENH: support for reading MultiIndex · Issue #262 - GitHub
fastparquet supports writing multiple index levels and the pandas ... DataFrame(np.random.randn(10, 3), columns=list('abc'), index=pd.
Read more >Release Notes — fastparquet 0.7.1 documentation
Fastparquet used to cast such columns to float, so that we could represent NULLs as NaN; now we use the new(er) masked types...
Read more >How do I save multi-indexed pandas dataframes to parquet?
Show activity on this post. pyarrow can write pandas multi-index to parquet files. ... DataFrame(np.random.rand(6,4)) df_test.columns = pd.
Read more >fastparquet Documentation - Read the Docs
The properties columns, count, dtypes and statistics are available to assist with this, and a summary in info. In addition, if the data...
Read more >What's new in 1.5.0 (September 19, 2022) - Pandas
Currently timezones in datetime columns are not preserved when a dataframe ... MultiIndex.to_frame() now supports the argument allow_duplicates and raises ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Parquet does not allow non-string column names of any sort. The layout you have could perhaps be regarded as a nested structure - but fastparquet does not currently support writing this. The simple workaround would be to encode the column names in a flat scheme:
and write that instead. It is conceivable that fastparquet could do that, and save the hierarchical column layout in the parquet pandas metadata, to reconstruct on load. You may wish to check how arrow handles this case. I think actually writing a structured schema is off the table for fastparquet.
thanks for the invitation. I’m prone to being nerdsniped by problems but I’ll try my best, for my sake, to avoid getting drawn into this. (Don’t) wish me luck… 🙂