Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pyarrow metadata `RuntimeError` in `to_parquet`

See original GitHub issue

Offline a user reported getting RuntimeError: file metadata is only available after writer close when writing a Dask DataFrame to parquet with our pyarrow engine. The traceback they were presented with was:

Traceback (most recent call last):
  File "example.py", line 349, in <module>
    main(date_dict, example_conf)
  File "example.py", line 338, in main
    make_example_datasets(
  File "example.py", line 311, in make_example_datasets
    default_to_parquet(sub_ddf, v["path"], engine="pyarrow", overwrite=True)
  File "example.py", line 232, in default_to_parquet
    ddf.to_parquet(path=path, engine=engine, overwrite=overwrite, write_metadata_file=False)
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/core.py", line 4453, in to_parquet
    return to_parquet(self, path, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 721, in to_parquet
    out = out.compute(**compute_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 286, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 568, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 2743, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 2020, in gather
    return self.sync(
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 861, in sync
    return sync(
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 326, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 309, in f
    result[0] = yield future
  File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1885, in _gather
    raise exception.with_traceback(traceback)
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 947, in write_partition
    pq.write_table(
  File "/opt/conda/lib/python3.8/site-packages/pyarrow/parquet.py", line 1817, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "/opt/conda/lib/python3.8/site-packages/pyarrow/parquet.py", line 662, in __exit__
    self.close()
  File "/opt/conda/lib/python3.8/site-packages/pyarrow/parquet.py", line 684, in close
    self._metadata_collector.append(self.writer.metadata)
  File "pyarrow/_parquet.pyx", line 1434, in pyarrow._parquet.ParquetWriter.metadata.__get__
RuntimeError: file metadata is only available after writer close

cc @rjzamora in case you’ve seen this before or have an idea of what might be causing this

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:25 (13 by maintainers)

Top GitHub Comments

1reaction

jorisvandenbosschecommented, May 24, 2022

Thanks for the reproducer! I can reproduce it with the above dask example, but if I try to extract the relevant pyarrow example, I don’t see the failure:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({"time": pd.date_range("2022-01-01", "2022-01-02", periods=500)})
table = pa.table(df)

metadata_collector = []

with open("test_invalid.parquet", "wb") as fil:
    pq.write_table(table, fil, coerce_timestamps="us", allow_truncated_timestamps=False, metadata_collector=metadata_collector)

(I get the correct error about “Casting from timestamp[ns] to timestamp[us] would lose data”, and the metadata_collector actually gets filled with a FileMetaData object)

Would the fact that it is executed in threads when using dask influence it somehow?

So if it fixes the error for you, we can certainly apply the patch. But it would be nice to have a reproducer for our own test suite as well that doesn’t rely on dask.

1reaction

ian-r-rosecommented, Apr 4, 2022

Unfortunately not, I thought I had it, and it went away again…

Top Results From Across the Web

Writing a dask dataframe to parquet using to_parquet() results ...

Writing a dask dataframe to parquet using to_parquet() results "RuntimeError: file metadata is only available after writer close" · Ask Question.

Arrow Flight — Apache Arrow Python Cookbook documentation

Simple Parquet storage service with Arrow Flight¶ ; if __name__ ; import pyarrow ; # Upload a new dataset data_table ; # Retrieve...

Re: [Parquet] writing metadata for dataset with partitions

Happy to take a more manual approach to writing the metadata - as ... numpy as np > import pyarrow as pa >...

Reading and Writing the Apache Parquet Format

We write this to Parquet format with write_table : ... 4 num_rows: 3 total_byte_size: 296 In [35]: metadata.row_group(0).column(0) Out[35]: <pyarrow.

Source code for dask.dataframe.io.parquet.core

By default will be inferred from the pandas parquet file metadata, if present. ... except RuntimeError: raise RuntimeError("Please install either pyarrow or ...