Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deprecate "pyarrow-legacy" engine in dask.dataframe.read_parquet

See original GitHub issue

Given the pyarrow version requirements in Dask, we can now assume the dataset API will be supported if pyarrow is installed. We can also assume that deprecation warnings will be raised by the pyarrow backend if dd.read_parquet(..., engine="pyarrow-legacy") is used. Therefore, I propose that we add an explicit deprecation warning for the “pyarrow-legacy” engine itself, and establish a timeline for its removal.

Deprecating and removing “pyarrow-legacy” should simplifyread_parquet maintenance, and should have few (if any) downsides. However, I welcome pushback if others expect this to cause pain or problems.

Issue Analytics

State:
Created 2 years ago
Comments:11 (10 by maintainers)

Top GitHub Comments

1reaction

jorisvandenbosschecommented, Apr 6, 2022

Also the PartitionObj could be cleaned up (or at least its docstring; it might be that it’s still a useful abstraction for only the new code as well).

1reaction

martindurantcommented, Oct 9, 2021

I would be surprised if there were still options/features available with the legacy backend not covered by dataset. I am not the expert on that, though.

Top Results From Across the Web

dask.dataframe.read_parquet - Dask documentation

To express OR in predicates, one must use the (preferred for “pyarrow”) List[List[Tuple]] notation. Note that the “fastparquet” engine does not currently ...

Dask df.to_parquet can't find pyarrow. RuntimeError

It does seem that dask and pure python are using different environments. In the first example the path is:.

Dask Read Parquet Files into DataFrames with read_parquet

Dask read_parquet : pyarrow vs fastparquet engines. You can read and write Parquet files to Dask DataFrames with the fastparquet and pyarrow ...

`read_parquet` of dask is really slow compared to spark

This reads a directory of Parquet data into a Dask.dataframe, one file per partition. It selects the index among the sorted columns if...

Scaling to large datasets — pandas 1.1.5 documentation

Pandas provides data structures for in-memory analytics, ... engine="pyarrow") In [22]: ddf Out[22]: Dask DataFrame Structure: id name x ...