Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_parquet with fastparquet does not support DNF filters

See original GitHub issue

What happened: Calling read_parquet with filters in DNF format (i.e. List of Lists of Tuples), using the fastparquet backend, does not filter partitions and returns all partitions instead.

What you expected to happen: I expected filters to be applied as stated in the documentation of read_parquet.

Minimal Complete Verifiable Example:

I expect the following to return parts that have dataset_name == 'option1' OR dataset_name == 'option2'. Instead, all the partitions are returned regardless of dataset_name.

from dask import dataframe as dd
dd.read_parquet(
    '/some/path/to.parquet', infer_divisions=True, 
    filters=[
        [
            ('dataset_name', '==', 'option1'), 
            ('dataset_name', '==', 'option2')
        ]
    ])

Anything else we need to know?: From a shallow analysis, seems fastparquet.api.filter_out_cats does not handle the “list of lists of tuples” input, and expects only list of tuples. The former does not crash the code but results in a nonsensical comparison in the fastparquet code where the tuple is compared against the column name, so that a filter is never applied.

Environment:

Dask version: 2.21.0
fastparquet version: 0.4.1
Python version: 3.8.4
Operating System: Win10
Install method (conda, pip, source): pip

Issue Analytics

State:
Created 3 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

1reaction

jsignellcommented, Mar 1, 2021

I think that’s enough to close it. Thanks for reporting back here @Madhu94

1reaction

jsignellcommented, Nov 16, 2020

My vote would be to upgrade the fastparquet logic to match the doc as filtering is kind of useless for multi-partitioned data otherwise.

You are welcome to open an issue or PR over on fastparquet to request this feature, but in the meantime I think the docs should be made to reflect reality 😃

Top Results From Across the Web

dask.dataframe.read_parquet - Dask documentation

Note that the “fastparquet” engine does not currently support DNF for the filtering of partitioned columns ( List[Tuple] is required).

Python and Parquet Performance - Data Syndrome

This post outlines how to use all common Python libraries to read and write Parquet format while taking advantage of columnar storage, ...

API — fastparquet 0.7.1 documentation

If filters are applied, it may happen that data is so reduced that 'nrows' is not ensured (fewer rows). returns: dataframe. property info¶....

Reading index based range from Parquet File using Python

I've tried pandas with fastparquet engine and even pyarraw but can't seem to find any option to do so. Is there any way...

pyarrow.parquet.ParquetDataset — Apache Arrow v10.0.1

Rows which do not match the filter predicate will be removed from scanned data. ... level filtering and different partitioning schemes are supported....