[Discussion] Improve Parquet-Metadata Processing in read_parquet
See original GitHub issueCurrent Metadata-Related Challenges in read_parquet
The current approach to parquet-metadata handling in Dask-Dataframe has been causing pain for many users recently. The problem is especially true for large-scale IO from cloud-based file systems. The origin of this pain seems to be the historical decision to rely on a shared _metadata file.
Why we use _metadata
The historical reason for our adoption of the _metadata file is simple: The usual convention in Dask is to only use the client process to construct a graph when a public collection-related API is used. There are certainly exceptions to this (e.g. set_index
), but it is typically true that non-compute/persist API calls will only execute on the client. Therefore, in order to avoid the slow process of opening and processing footer metadata for every file in a parquet dataset on the client, we have encouraged the construction of a single/global _metadata file at write time. This is why Dask’s to_parquet
implementation will construct and write this file by default (see the write_metadata_file
argument).
Why _metadata can be a problem
Although most parquet users will likely benefit from writing/reading a shared _metadata file, there are clearly cases where this approach breaks down. As illustrated in issues like #8031 and #8027, there are large-scale IO scenarios in which a single metadata file can be too large to write and/or read on a single process. Given that the entire purpose of Dask-Dataframe is to enable/accelerate large-scale tabular-data processing, I feel that we should treat these cases seriously.
As I will present below, my short-term suggestion to the _metadata problem is two-fold:
- Make it possible to ignore the existence of a shared _metadata file when it is present (Note that this step is already implemented in #8034)
- Refactor the
Engine.read_metadata
implementation internals (especially for “pyarrow-datsaet” and “fastparquet”) to allowparts
/statistics
to be collected in parallel, and to avoid the unnecessary/intermediate step of constructing a proxy global-metadata object when the _metadata file is missing and/or needs to be ignored.
The Current Organization of read_parquet
Core-Engine Interface
To understand the proposed short-term _metadata solution, it is useful to have a rough understanding of “core”-“engine” interface that is currently used within Dask-Dataframe’s read_parquet
API. When the user calls read_parquet
, they are either explicitly or implicitly specifying a backend engine. That engine is then expected to produce three critical pieces of information when engine.read_metadata
is called:
meta
: The metadata (an emptypandas.DataFrame
object) for the output DataFrame collectionparts
: Thelist
of information needed to produce the output-DataFrame partitions at IO time. Afterparts
is finalized, each if its elements will be used for a distinct call toengine.read_partition
to produce a single output-pandas.DataFrame
partition. For this reason, generatingparts
is effectively the same as materializing the task graph.statistics
: A list of statistics for each element ofparts
. This list is required to calculate output divisions, aggregate files bychunksize
, and/or apply filters (except for “pyarrow-dataset”).
Outline of the existing read_parquet algorithm:
def read_parquet(paths, filters, engine, ...):
...
# ENGINE logic to calculate meta, parts & statistics.
#
# This is where the engine is expected to produce output
# `meta`, and the `parts` & `statistics` lists-
(
meta,
parts,
statistics,
...
) = engine.read_metadata(paths, filters, ...)
# CORE logic to apply filters and calculate divisions
(
parts,
divisions,
...
) = process_statistics(parts, statistics, filters, ...)
# CORE logic to define Layer/DataFrame
layer = DataFrameIOLayer(parts, ...)
...
return new_dd_object(...)
Although the best long-term read_parquet
API will likely need to break the single read_metadata
API into 2-3 distinct functions, the short-term proposal will start by leaving the surface area of this function as is. Instead of modifying the “public” Engine API, I suggest that we focus on a smaller internal refactor of read_metadata
.
Engine-Specific Logic: Collecting Metadata and parts
/statistics
In order to understand how we must modify the existing Engine.read_metadata
implementations, it is useful to describe how these functions are currently designed. The general approach comprises three steps:
- Use engine-specific logic to construct a global parquet-metadata object (or a global-metadata “proxy”)
- Use collected metadata/schema information to define the output DataFrame
meta
- Use a mix of shared and engine-specific logic to convert
metadata
toparts
andstatistics
Outline of the existing ArrowDatasetEngine.read_metadata implementation:
class ArrowDatasetEngine:
@classmethod
def read_metadata(cls, paths, filters, ...):
# Collect a `metadata` structure. For "pyarrow-legacy",
# this is a `pyarrow.parquet.FileMetaData` object. For
# "pyarrow-dataset", this is a list of dataset fragments.
metadata, ... = cls._gather_metadata(paths, filters, ...)
# Use the pyarrow.dataset schema to construct the `meta`
# of the output DataFrame collection
meta, ... = cls._generate_dd_meta(schema, ...)
# Use the `metadata` object (list of fragments) to construct
# coupled `parts` and `statistics` lists
parts, statistics, ... = cls._construct_parts(metadata, ...)
return meta, statistics, parts
@classmethod
def _gather_metadata(cls, paths, ...):
# Create pyarrow.dataset object
ds = pa_dataset.dataset(paths, ...)
# Collect filtered list of dataset "fragments".
# Call this list of fragments the "metadata"
# (this is NOT a formal "parquet-metadata" object)
metadata = _collect_pyarrow_dataset_frags(ds, filters, ...)
return schema, metadata, ...
@classmethod
def _construct_parts(cls, metadata, ...):
# Here we use a combination of engine-specific and
# shared logic to construct coupled `parts` and `statistics`
parts, statistics = <messy-logic>(metadata, ...)
return parts, statistics, ...
The exact details of these steps depend on the specific engine, but the general algorithm is pretty much the same for both “pyarrow” and “fastparquet”. This general algorithm works well in many cases, but it has the following drawbacks:
- The metadata/metadata-proxy object construction does not yet allow the user to opt out of _metadata processing.
- The metadata/metadata-proxy object is only collected in parallel for the “arrow-legacy” API (not for “pyarrow” or “fastparquet”)
- Even when the metadata object is collected in parallel for “pyarrow-legacy”, it is still reduced into a single object and then processed in serial on the client. This is clearly not the most efficient way to produce the final
parts
/statistics
lists. - (more of a future problem) The
meta
is not constructed until after a metadata or metadata-proxy object has been constructed. This is not a problem yet, but is likely to become one when it is time to implement an abstract-expression API for read_parquet.
Short-Term Proposal
Make _metadata Processing Optional
I propose that we allow the user to specify that a global _metadata file should be ignored by Dask. This first step is already implemented in #8034, where a new ignore_metadata_file=
kwarg has been added to the public read_parquet
API. Please feel free to provide specific feedback in that PR.
Refactor *Engine.read_metadata
Outline of the PROPOSED ArrowDatasetEngine.read_metadat
a implementation:
class ArrowDatasetEngine:
@classmethod
def read_metadata(cls, paths, filters, ignore_metadata_file, ...):
# Stage 1: Use a combination of engine specific logic and shared
# fsspec utilities to construct an engine-specific `datset_info` dictionary.
dataset_info = cls._collect_dataset_info(paths, ignore_metadata_file, ...)
# Stage 2: Use information in `dataset_info` (like schema) to define
# the `meta` for the output DataFrame collection.
meta, ... = cls._generate_dd_meta(dataset_info, ...)
# Stage 3: Use information in `dataset_info` to directly construct
# `parts` and `statistics`
parts, statistics = cls._make_partition_plan(dataset_info, meta, filters, ...)
return meta, statistics, parts
Stage-1 Details
The purpose of this stage is to do as little work as possible to populate a dictionary of high-level information about the dataset in question. This “high-level” information will most likely include the schema, the paths, the file-system, and the discovered hive partitions. Although our first pass at this should not try to do anything particularly clever here, we may eventually want to use this space to discover the paths/hive-partitions in parallel. We can also leverage #9051 (categorical_partitions
) to avoid the need to discover all files up front (since we may not need full categorical dtypes for the schema).
Assuming that we should keep stage 1 simple for now, the new _collect_dataset_info
functions will effectively pull out the existing logic in *engine.read_metadata
used to define a pyarrow.dataset
/ParquetDataset
/ParquetFile
objects. These objects can be stored, along with other important information, in the output dataset_info
dictionary. Note that the only critical detail in this stage is that we must support the ignore_metadata_file=True
option.
Stage-2 Details
There is not much (if any) work to be done here. The existing logic in the various _generate_dd_meta
implementations can be reused.
Stage-3 Details
This stage will correspond to the lion’s share of the pain required for this work. While the short-term plan for Stage-1 and Stage-2 are to effectively move around existing code into simple functions with clear objectives, Stage-3 will require us to implement a new algorithm to (optionally) construct parts
/statistics
in parallel. I expect the exact details here to be pretty engine-specific. That is, “pyarrow-datset” will (optionally) parallelize over the processing of file fragments into parts
/statistics
, while “fastparquet” will need to parallelize over paths a bit differently. However, I do expect all engine to follow a similar “parallelize over file-path/object” approach. In the case that the _metadata file exists (and the user has not asked to ignore it), we should avoid constructing parts
/statistics
on the workers.
ROUGH illustration of the PROPOSED ArrowDatasetEngine._make_partition_plan
implementation:
@classmethod
def _make_partition_plan(cls, dataset_info, meta, filters, split_row_groups, ...):
parts, statistics = [], []
# (OPTIONALLY) DASK-PARALLELIZE THIS LOOP:
for file_object in all_file_objects:
part, part_stats = cls._collect_file_parts(file_object, ...)
return parts, statistics
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:8 (8 by maintainers)
Thanks for engaging in this discussion @jorisvandenbossche and @MrPowers ! Sorry - I didn’t realize there were comments I missed here.
It would be super helpful to get your eyes on that PR if you have time 😃
You have this all correct, but I believe my focus is on slightly different “use cases” than you seem to be assuming here. I do not intend to improve the case where we have a proper _metadata file to use for partition/division planning. I am targeting the case where there is no _metadata file. This may happen when (1) the dataset was written by some engine other than Dask, (2) The dataset was too large to efficiently write a single _metadata file, or (3) The _metadata file needs to be ignored by the Dask client for whatever reason; e.g. it is too large to process on a single process or the data is invalid.
Therefore, the target of this plan (and of #8072 and #8092) is to redesign
read_metadata
to enable better performance for these non-_metadata cases (which I expect to become more and more common at scale). Overall, the proposal is to merge a pretty straightforward refactor to reduce the (currently dramatic) performance hit that we now see when the _metadata file is missing or ignored.I am in complete agreement that Dask must throw away the assumption that the _metadata file will always be present. That is one motivation for this proposal. Also, it is a good point that alternative approaches like Apache Iceberg & DeltaLake may be the best answer for true “big data” use cases. I am very supportive of adding Dask support for other metadata-logging approaches, but I also think that it is important that Dask have reliable support for raw parquet datasets at relatively large scale.
@MrPowers - It is great that you looking into Delta Lake! I am very excited to have your expertise involved here. My gut tells me that many Dask users will benefit from something like the Delta Lake approach to metadata scaling. With that said, my current preference would be to expose any new system-specific IO approach as distinct
read_*
/write_*
APIs in Dask. We could certainly leverage the same Engine/parquet logic, but I’d like to avoid adding a new default global-metadata approach for rawread
/write_parquet
. I’m sure my preference could change here, and you may already have a separate API in mind, but I just wanted to share my thoughts.Just a heads up that I submitted #8072 with a draft of some of the “pyarrow-dataset” changes I proposed. If anyone has a system that was previously struggling with up-front metadata-processing, I’d be curious to know if the performance improves at all with that PR (cc @jrbourbeau).
Note that you can play with both
ignore_metadata_file
andfiles_per_metadata_tasks
to compare the different algorithms.