question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Discussion] Improve Parquet-Metadata Processing in read_parquet

See original GitHub issue

Current Metadata-Related Challenges in read_parquet

The current approach to parquet-metadata handling in Dask-Dataframe has been causing pain for many users recently. The problem is especially true for large-scale IO from cloud-based file systems. The origin of this pain seems to be the historical decision to rely on a shared _metadata file.

Why we use _metadata

The historical reason for our adoption of the _metadata file is simple: The usual convention in Dask is to only use the client process to construct a graph when a public collection-related API is used. There are certainly exceptions to this (e.g. set_index), but it is typically true that non-compute/persist API calls will only execute on the client. Therefore, in order to avoid the slow process of opening and processing footer metadata for every file in a parquet dataset on the client, we have encouraged the construction of a single/global _metadata file at write time. This is why Dask’s to_parquet implementation will construct and write this file by default (see the write_metadata_file argument).

Why _metadata can be a problem

Although most parquet users will likely benefit from writing/reading a shared _metadata file, there are clearly cases where this approach breaks down. As illustrated in issues like #8031 and #8027, there are large-scale IO scenarios in which a single metadata file can be too large to write and/or read on a single process. Given that the entire purpose of Dask-Dataframe is to enable/accelerate large-scale tabular-data processing, I feel that we should treat these cases seriously.

As I will present below, my short-term suggestion to the _metadata problem is two-fold:

  1. Make it possible to ignore the existence of a shared _metadata file when it is present (Note that this step is already implemented in #8034)
  2. Refactor the Engine.read_metadata implementation internals (especially for “pyarrow-datsaet” and “fastparquet”) to allow parts/statistics to be collected in parallel, and to avoid the unnecessary/intermediate step of constructing a proxy global-metadata object when the _metadata file is missing and/or needs to be ignored.

The Current Organization of read_parquet

Core-Engine Interface

To understand the proposed short-term _metadata solution, it is useful to have a rough understanding of “core”-“engine” interface that is currently used within Dask-Dataframe’s read_parquet API. When the user calls read_parquet, they are either explicitly or implicitly specifying a backend engine. That engine is then expected to produce three critical pieces of information when engine.read_metadata is called:

  1. meta: The metadata (an empty pandas.DataFrame object) for the output DataFrame collection
  2. parts: The list of information needed to produce the output-DataFrame partitions at IO time. After parts is finalized, each if its elements will be used for a distinct call to engine.read_partition to produce a single output-pandas.DataFrame partition. For this reason, generating parts is effectively the same as materializing the task graph.
  3. statistics: A list of statistics for each element of parts. This list is required to calculate output divisions, aggregate files by chunksize, and/or apply filters (except for “pyarrow-dataset”).

Outline of the existing read_parquet algorithm:

def read_parquet(paths, filters, engine, ...):
    ...
    # ENGINE logic to calculate meta, parts & statistics.
    #
    # This is where the engine is expected to produce output
    # `meta`, and the `parts` & `statistics` lists-
    (
        meta,
        parts,
        statistics,
        ...
    ) = engine.read_metadata(paths, filters, ...)

    # CORE logic to apply filters and calculate divisions
    (
        parts,
        divisions,
        ...
    ) = process_statistics(parts, statistics, filters, ...)

    # CORE logic to define Layer/DataFrame
    layer = DataFrameIOLayer(parts, ...)
    ...
    return new_dd_object(...)

Although the best long-term read_parquet API will likely need to break the single read_metadata API into 2-3 distinct functions, the short-term proposal will start by leaving the surface area of this function as is. Instead of modifying the “public” Engine API, I suggest that we focus on a smaller internal refactor of read_metadata.

Engine-Specific Logic: Collecting Metadata and parts/statistics

In order to understand how we must modify the existing Engine.read_metadata implementations, it is useful to describe how these functions are currently designed. The general approach comprises three steps:

  1. Use engine-specific logic to construct a global parquet-metadata object (or a global-metadata “proxy”)
  2. Use collected metadata/schema information to define the output DataFrame meta
  3. Use a mix of shared and engine-specific logic to convert metadata to parts and statistics

Outline of the existing ArrowDatasetEngine.read_metadata implementation:

class ArrowDatasetEngine:

    @classmethod
    def read_metadata(cls, paths, filters, ...):
	
        # Collect a `metadata` structure. For "pyarrow-legacy",
        # this is a `pyarrow.parquet.FileMetaData` object. For
        # "pyarrow-dataset", this is a list of dataset fragments.
        metadata, ... = cls._gather_metadata(paths, filters, ...)

        # Use the pyarrow.dataset schema to construct the `meta`
        # of the output DataFrame collection
        meta, ... = cls._generate_dd_meta(schema, ...)
		
        # Use the `metadata` object (list of fragments) to construct
        # coupled `parts` and `statistics` lists
        parts, statistics, ... = cls._construct_parts(metadata, ...)
		
        return meta, statistics, parts
		
    @classmethod
    def _gather_metadata(cls, paths, ...):
	
        # Create pyarrow.dataset object
        ds = pa_dataset.dataset(paths, ...)
		
        # Collect filtered list of dataset "fragments".
        # Call this list of fragments the "metadata"
        # (this is NOT a formal "parquet-metadata" object)
        metadata = _collect_pyarrow_dataset_frags(ds, filters, ...)

        return schema, metadata, ...
		
    @classmethod
    def _construct_parts(cls, metadata, ...):
	
        # Here we use a combination of engine-specific and
        # shared logic to construct coupled `parts` and `statistics`
        parts, statistics = <messy-logic>(metadata, ...)

        return parts, statistics, ...

The exact details of these steps depend on the specific engine, but the general algorithm is pretty much the same for both “pyarrow” and “fastparquet”. This general algorithm works well in many cases, but it has the following drawbacks:

  1. The metadata/metadata-proxy object construction does not yet allow the user to opt out of _metadata processing.
  2. The metadata/metadata-proxy object is only collected in parallel for the “arrow-legacy” API (not for “pyarrow” or “fastparquet”)
  3. Even when the metadata object is collected in parallel for “pyarrow-legacy”, it is still reduced into a single object and then processed in serial on the client. This is clearly not the most efficient way to produce the final parts/statistics lists.
  4. (more of a future problem) The meta is not constructed until after a metadata or metadata-proxy object has been constructed. This is not a problem yet, but is likely to become one when it is time to implement an abstract-expression API for read_parquet.

Short-Term Proposal

Make _metadata Processing Optional

I propose that we allow the user to specify that a global _metadata file should be ignored by Dask. This first step is already implemented in #8034, where a new ignore_metadata_file= kwarg has been added to the public read_parquet API. Please feel free to provide specific feedback in that PR.

Refactor *Engine.read_metadata

Outline of the PROPOSED ArrowDatasetEngine.read_metadata implementation:

class ArrowDatasetEngine:
    @classmethod
    def read_metadata(cls, paths, filters, ignore_metadata_file, ...):

        # Stage 1: Use a combination of engine specific logic and shared
        # fsspec utilities to construct an engine-specific `datset_info` dictionary.
        dataset_info = cls._collect_dataset_info(paths, ignore_metadata_file, ...)

        # Stage 2: Use information in `dataset_info` (like schema) to define
        # the `meta` for the output DataFrame collection.
        meta, ... = cls._generate_dd_meta(dataset_info, ...)
		
        # Stage 3: Use information in `dataset_info` to directly construct
        # `parts` and `statistics`
        parts, statistics = cls._make_partition_plan(dataset_info, meta, filters, ...)
		
        return meta, statistics, parts

Stage-1 Details

The purpose of this stage is to do as little work as possible to populate a dictionary of high-level information about the dataset in question. This “high-level” information will most likely include the schema, the paths, the file-system, and the discovered hive partitions. Although our first pass at this should not try to do anything particularly clever here, we may eventually want to use this space to discover the paths/hive-partitions in parallel. We can also leverage #9051 (categorical_partitions) to avoid the need to discover all files up front (since we may not need full categorical dtypes for the schema).

Assuming that we should keep stage 1 simple for now, the new _collect_dataset_info functions will effectively pull out the existing logic in *engine.read_metadata used to define a pyarrow.dataset/ParquetDataset/ParquetFile objects. These objects can be stored, along with other important information, in the output dataset_info dictionary. Note that the only critical detail in this stage is that we must support the ignore_metadata_file=True option.

Stage-2 Details

There is not much (if any) work to be done here. The existing logic in the various _generate_dd_meta implementations can be reused.

Stage-3 Details

This stage will correspond to the lion’s share of the pain required for this work. While the short-term plan for Stage-1 and Stage-2 are to effectively move around existing code into simple functions with clear objectives, Stage-3 will require us to implement a new algorithm to (optionally) construct parts/statistics in parallel. I expect the exact details here to be pretty engine-specific. That is, “pyarrow-datset” will (optionally) parallelize over the processing of file fragments into parts/statistics, while “fastparquet” will need to parallelize over paths a bit differently. However, I do expect all engine to follow a similar “parallelize over file-path/object” approach. In the case that the _metadata file exists (and the user has not asked to ignore it), we should avoid constructing parts/statistics on the workers.

ROUGH illustration of the PROPOSED ArrowDatasetEngine._make_partition_plan implementation:

@classmethod
def _make_partition_plan(cls, dataset_info, meta, filters, split_row_groups, ...):

    parts, statistics = [], []
	
    # (OPTIONALLY) DASK-PARALLELIZE THIS LOOP:
    for file_object in all_file_objects:
        part, part_stats = cls._collect_file_parts(file_object, ...)
	
    return parts, statistics

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:2
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
rjzamoracommented, Sep 27, 2021

Thanks for engaging in this discussion @jorisvandenbossche and @MrPowers ! Sorry - I didn’t realize there were comments I missed here.

Thanks @rjzamora for this thorough wite-up and the proposal. Generally it certainly sounds as a good improvement of the current state. Will take a look at your open PR for this as well.

It would be super helpful to get your eyes on that PR if you have time 😃

I am wondering to what extent this really solves the problems, though. A large part of the problem comes from the single, large binary blob of the _metadata file that needs to be downloaded and parsed up-front (if you want to make use of it to avoid file listing on S3). While the proposed changes certainly move part of the effort (eg inferring divisions from the statistics) to a later stage and potentially parallelize that stage (i.e. _make_partition_plan), you still have the initial step of having to download and parse that big _metadata file that you don’t avoid in this approach (now, to be fair, I don’t really know how expensive those different stages are, and thus how much (or little) benefit it gives to move part to the _make_partition_plan. That might be interesting to profile).

You have this all correct, but I believe my focus is on slightly different “use cases” than you seem to be assuming here. I do not intend to improve the case where we have a proper _metadata file to use for partition/division planning. I am targeting the case where there is no _metadata file. This may happen when (1) the dataset was written by some engine other than Dask, (2) The dataset was too large to efficiently write a single _metadata file, or (3) The _metadata file needs to be ignored by the Dask client for whatever reason; e.g. it is too large to process on a single process or the data is invalid.

Therefore, the target of this plan (and of #8072 and #8092) is to redesign read_metadata to enable better performance for these non-_metadata cases (which I expect to become more and more common at scale). Overall, the proposal is to merge a pretty straightforward refactor to reduce the (currently dramatic) performance hit that we now see when the _metadata file is missing or ignored.

we should maybe also consider to “give up” on the _metadata approach for huge datasets, and look into more tightly integrating alternative approaches into dask.dataframe (eg Apache Iceberg, DeltaLake, … i.e. approaches that were designed to overcome some of the inherent limitations of _metadata)

I am in complete agreement that Dask must throw away the assumption that the _metadata file will always be present. That is one motivation for this proposal. Also, it is a good point that alternative approaches like Apache Iceberg & DeltaLake may be the best answer for true “big data” use cases. I am very supportive of adding Dask support for other metadata-logging approaches, but I also think that it is important that Dask have reliable support for raw parquet datasets at relatively large scale.

@MrPowers - It is great that you looking into Delta Lake! I am very excited to have your expertise involved here. My gut tells me that many Dask users will benefit from something like the Delta Lake approach to metadata scaling. With that said, my current preference would be to expose any new system-specific IO approach as distinct read_*/write_* APIs in Dask. We could certainly leverage the same Engine/parquet logic, but I’d like to avoid adding a new default global-metadata approach for raw read/write_parquet. I’m sure my preference could change here, and you may already have a separate API in mind, but I just wanted to share my thoughts.

1reaction
rjzamoracommented, Aug 20, 2021

Just a heads up that I submitted #8072 with a draft of some of the “pyarrow-dataset” changes I proposed. If anyone has a system that was previously struggling with up-front metadata-processing, I’d be curious to know if the performance improves at all with that PR (cc @jrbourbeau).

Note that you can play with both ignore_metadata_file and files_per_metadata_tasks to compare the different algorithms.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Build and use Parquet-tools to read parquet files - 248629
often we have need to read the parquet file, parquet-meta data or parquet-footer, parquet tools is shipped - 248629.
Read more >
Diving into Spark and Parquet Workloads, by Example
Drill down into Parquet metadata with diagnostic tools. The discussion on predicate push down in the previous section has touched on some ...
Read more >
Parquet interoperability across language ecosystems - Chapeau
In the remaining discussion, we'll look at how to work around some potential interoperability headaches when using Parquet to transfer data from ...
Read more >
Reading and Writing the Apache Parquet Format
The Apache Parquet project provides a standardized open-source columnar storage format for use in data analysis systems. It was created originally for use ......
Read more >
Spark Read and Write Apache Parquet
Spark parquet partition – Improving performance. Partitioning is a feature of many databases and data processing frameworks and it is key to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found