Deprecate "pyarrow-legacy" engine in dask.dataframe.read_parquet
See original GitHub issueGiven the pyarrow version requirements in Dask, we can now assume the dataset
API will be supported if pyarrow is installed. We can also assume that deprecation warnings will be raised by the pyarrow backend if dd.read_parquet(..., engine="pyarrow-legacy")
is used. Therefore, I propose that we add an explicit deprecation warning for the “pyarrow-legacy” engine itself, and establish a timeline for its removal.
Deprecating and removing “pyarrow-legacy” should simplifyread_parquet
maintenance, and should have few (if any) downsides. However, I welcome pushback if others expect this to cause pain or problems.
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (10 by maintainers)
Top Results From Across the Web
dask.dataframe.read_parquet - Dask documentation
To express OR in predicates, one must use the (preferred for “pyarrow”) List[List[Tuple]] notation. Note that the “fastparquet” engine does not currently ...
Read more >Dask df.to_parquet can't find pyarrow. RuntimeError
It does seem that dask and pure python are using different environments. In the first example the path is:.
Read more >Dask Read Parquet Files into DataFrames with read_parquet
Dask read_parquet : pyarrow vs fastparquet engines. You can read and write Parquet files to Dask DataFrames with the fastparquet and pyarrow ...
Read more >`read_parquet` of dask is really slow compared to spark
This reads a directory of Parquet data into a Dask.dataframe, one file per partition. It selects the index among the sorted columns if...
Read more >Scaling to large datasets — pandas 1.1.5 documentation
Pandas provides data structures for in-memory analytics, ... engine="pyarrow") In [22]: ddf Out[22]: Dask DataFrame Structure: id name x ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Also the
PartitionObj
could be cleaned up (or at least its docstring; it might be that it’s still a useful abstraction for only the new code as well).I would be surprised if there were still options/features available with the legacy backend not covered by dataset. I am not the expert on that, though.