question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deprecate "pyarrow-legacy" engine in dask.dataframe.read_parquet

See original GitHub issue

Given the pyarrow version requirements in Dask, we can now assume the dataset API will be supported if pyarrow is installed. We can also assume that deprecation warnings will be raised by the pyarrow backend if dd.read_parquet(..., engine="pyarrow-legacy") is used. Therefore, I propose that we add an explicit deprecation warning for the “pyarrow-legacy” engine itself, and establish a timeline for its removal.

Deprecating and removing “pyarrow-legacy” should simplifyread_parquet maintenance, and should have few (if any) downsides. However, I welcome pushback if others expect this to cause pain or problems.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
jorisvandenbosschecommented, Apr 6, 2022

Also the PartitionObj could be cleaned up (or at least its docstring; it might be that it’s still a useful abstraction for only the new code as well).

1reaction
martindurantcommented, Oct 9, 2021

I would be surprised if there were still options/features available with the legacy backend not covered by dataset. I am not the expert on that, though.

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask.dataframe.read_parquet - Dask documentation
To express OR in predicates, one must use the (preferred for “pyarrow”) List[List[Tuple]] notation. Note that the “fastparquet” engine does not currently ...
Read more >
Dask df.to_parquet can't find pyarrow. RuntimeError
It does seem that dask and pure python are using different environments. In the first example the path is:.
Read more >
Dask Read Parquet Files into DataFrames with read_parquet
Dask read_parquet : pyarrow vs fastparquet engines. You can read and write Parquet files to Dask DataFrames with the fastparquet and pyarrow ...
Read more >
`read_parquet` of dask is really slow compared to spark
This reads a directory of Parquet data into a Dask.dataframe, one file per partition. It selects the index among the sorted columns if...
Read more >
Scaling to large datasets — pandas 1.1.5 documentation
Pandas provides data structures for in-memory analytics, ... engine="pyarrow") In [22]: ddf Out[22]: Dask DataFrame Structure: id name x ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found