Pyarrow metadata `RuntimeError` in `to_parquet`
See original GitHub issueOffline a user reported getting RuntimeError: file metadata is only available after writer close when writing a Dask DataFrame to parquet with our pyarrow engine. The traceback they were presented with was:
Traceback (most recent call last):
File "example.py", line 349, in <module>
main(date_dict, example_conf)
File "example.py", line 338, in main
make_example_datasets(
File "example.py", line 311, in make_example_datasets
default_to_parquet(sub_ddf, v["path"], engine="pyarrow", overwrite=True)
File "example.py", line 232, in default_to_parquet
ddf.to_parquet(path=path, engine=engine, overwrite=overwrite, write_metadata_file=False)
File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/core.py", line 4453, in to_parquet
return to_parquet(self, path, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 721, in to_parquet
out = out.compute(**compute_kwargs)
File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 286, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 568, in compute
results = schedule(dsk, keys, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 2743, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 2020, in gather
return self.sync(
File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 861, in sync
return sync(
File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 326, in sync
raise exc.with_traceback(tb)
File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 309, in f
result[0] = yield future
File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1885, in _gather
raise exception.with_traceback(traceback)
File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 947, in write_partition
pq.write_table(
File "/opt/conda/lib/python3.8/site-packages/pyarrow/parquet.py", line 1817, in write_table
writer.write_table(table, row_group_size=row_group_size)
File "/opt/conda/lib/python3.8/site-packages/pyarrow/parquet.py", line 662, in __exit__
self.close()
File "/opt/conda/lib/python3.8/site-packages/pyarrow/parquet.py", line 684, in close
self._metadata_collector.append(self.writer.metadata)
File "pyarrow/_parquet.pyx", line 1434, in pyarrow._parquet.ParquetWriter.metadata.__get__
RuntimeError: file metadata is only available after writer close
cc @rjzamora in case you’ve seen this before or have an idea of what might be causing this
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:25 (13 by maintainers)
Top Results From Across the Web
Writing a dask dataframe to parquet using to_parquet() results ...
Writing a dask dataframe to parquet using to_parquet() results "RuntimeError: file metadata is only available after writer close" · Ask Question.
Read more >Arrow Flight — Apache Arrow Python Cookbook documentation
Simple Parquet storage service with Arrow Flight¶ ; if __name__ ; import pyarrow ; # Upload a new dataset data_table ; # Retrieve...
Read more >Re: [Parquet] writing metadata for dataset with partitions
Happy to take a more manual approach to writing the metadata - as ... numpy as np > import pyarrow as pa >...
Read more >Reading and Writing the Apache Parquet Format
We write this to Parquet format with write_table : ... 4 num_rows: 3 total_byte_size: 296 In [35]: metadata.row_group(0).column(0) Out[35]: <pyarrow.
Read more >Source code for dask.dataframe.io.parquet.core
By default will be inferred from the pandas parquet file metadata, if present. ... except RuntimeError: raise RuntimeError("Please install either pyarrow or ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Thanks for the reproducer! I can reproduce it with the above dask example, but if I try to extract the relevant pyarrow example, I don’t see the failure:
(I get the correct error about “Casting from timestamp[ns] to timestamp[us] would lose data”, and the
metadata_collectoractually gets filled with a FileMetaData object)Would the fact that it is executed in threads when using dask influence it somehow?
So if it fixes the error for you, we can certainly apply the patch. But it would be nice to have a reproducer for our own test suite as well that doesn’t rely on dask.
Unfortunately not, I thought I had it, and it went away again…