Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make use of the new pyarrow.dataset functionality instead of ParquetDataset

See original GitHub issue

In the Apache Arrow project, we have been working the last year on a new Dataset API. (original design document: https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit, current python docs: https://arrow.apache.org/docs/python/dataset.html).

The dataset API has the goal to handle (scan/materialize) data sources larger than memory, and specifically provide:

a unified interface for different sources (different file systems (local, cloud), different file formats (Parquet, CSV, JSON, feather, …), and potentially also other sources like odbc connection, but this is not yet implemented)
discovery of sources (crawling directories, handle directory-based partitioned datasets, basic schema normalization, …)
optimized reading with pedicate pushdown (filtering rows), projection (selecting columns), parallel reading or ability to manage scan tasks yourself (eg with dask)

Think: the current pyarrow.parquet.ParquetDataset functionality, but then not specific to parquet (currently also Feather ands CSV are supported), not tied to python (so for example the R bindings of arrow also use this), and with more features (better schema normalization, more partitioning schemes, predicate pushdown / row group filtering, etc).

A small illustrative example of how it can look like in python using the new generic API:

import pyarrow.dataset as ds
dataset = ds.dataset("path/to/partitioned/dataset",
                     format="parquet", partitioning="hive", filesystem=..)
table = dataset.to_table(columns=['col1', 'col2'],
                         filter=(ds.field('key2') == 1) & (ds.field('key2'== 2))

In addition, we also support using it from the existing pyarrow.parquet API with a keyword:

import pyarrow.parquet as pq
pq.ParquetDataset("path/to/partitioned/dataset", use_legacy_dataset=False)

This ParquetDataset “shim”, however, does not support the full existing API. So for example, it does not support the .pieces and ParquetDatasetPiece etc, which are specifically APIs that dask is using.

Long term, we would like to have this new Datasets implementation replace the python implementation of pyarrow.parquet.ParquetDataset (the basic pyarrow.parquet.read/write_table and pyarrow.parquet.ParquetFile are normally there to stay, but the Python ParquetDataset implementation hopefully not, long term, as it is duplicating the new Datasets implementation).

Given that pyarrow.ParquetDataset eventually might go away (in its current form, we probably want to keep something similar, possibly with the same name, but probably not with an exactly identical “pieces” API), such a change would certainly have an impact on petastorm, which seems to be heavy user of the current python ParquetDataset APIs.

For that reason, I wanted to bring this up, as those changes certainly will impact petastorm, but at the same time feedback on the new APIs (whether they are useful / sufficient / … for petastorms use cases) is very valuable.

Issue Analytics

State:
Created 3 years ago
Reactions:5
Comments:21

Top GitHub Comments

2reactions

aperiodiccommented, Apr 15, 2022

I’ve got a PR pretty much ready to go to add the wrapper class, but I just learned today that there is a new review process for OSS submissions that I have to put this through first. That could add as much a month, based on historical review times. Hopefully it’ll go much faster than that because this is pretty small and clearly unrelated to our IP, but I’ll keep y’all updated.

1reaction

aperiodiccommented, Apr 5, 2022

Oh good, I didn’t notice that there’s still functionality for reading metadata files. That clears up most of my remaining concerns about moving over to the new API. Thanks for the tips, @jorisvandenbossche!

I think it’s unlikely that there are many users that are using rowgroup_selector and shuffle_row_drop_partitions (but I don’t have a solid proof that my assumption is correct). Once we start moving with your plan, perhaps we should start issuing a deprecation message in the code and see if there is any pushback from the community.

That sounds reasonable to me, at least initially. Once there is a way to use the new PyArrow API for make_batch_reader, we can get some feedback from users and decide whether it makes more sense to focus on moving everything else over to the new API or to backfill support for those two other arguments of make_batch_reader when the new API is being used.

The plan looks solid to me.

Great! I’ll get started on the wrapper class this week, and will hopefully have a PR up for that sometime next week.

Top Results From Across the Web

pyarrow.parquet.ParquetDataset — Apache Arrow v10.0.1

Encapsulates details of reading a complete Parquet dataset possibly consisting of multiple files and partitions in subdirectories. Parameters:.

Tabular Datasets — Apache Arrow v10.0.1

The pyarrow.dataset module provides functionality to efficiently work with tabular, potentially larger than memory, and multi-file datasets. This includes:.

pyarrow.dataset.Dataset — Apache Arrow v10.0.1

Build a scan operation against the dataset. take (self, indices, **kwargs). Select rows of data by index.

pyarrow.dataset.write_dataset — Apache Arrow v10.0.1

pyarrow.dataset.write_dataset(data, base_dir, *, basename_template=None, ... If an attempt is made to open too many files then the least recently used file ...

pyarrow.parquet.write_to_dataset — Apache Arrow v10.0.1

pyarrow.parquet.write_to_dataset(table, root_path, partition_cols=None, ... used as additional kwargs for dataset.write_dataset function (passed to ...