question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_parquet with fastparquet does not support DNF filters

See original GitHub issue

What happened: Calling read_parquet with filters in DNF format (i.e. List of Lists of Tuples), using the fastparquet backend, does not filter partitions and returns all partitions instead.

What you expected to happen: I expected filters to be applied as stated in the documentation of read_parquet.

Minimal Complete Verifiable Example:

I expect the following to return parts that have dataset_name == 'option1' OR dataset_name == 'option2'. Instead, all the partitions are returned regardless of dataset_name.

from dask import dataframe as dd
dd.read_parquet(
    '/some/path/to.parquet', infer_divisions=True, 
    filters=[
        [
            ('dataset_name', '==', 'option1'), 
            ('dataset_name', '==', 'option2')
        ]
    ])

Anything else we need to know?: From a shallow analysis, seems fastparquet.api.filter_out_cats does not handle the “list of lists of tuples” input, and expects only list of tuples. The former does not crash the code but results in a nonsensical comparison in the fastparquet code where the tuple is compared against the column name, so that a filter is never applied.

Environment:

  • Dask version: 2.21.0
  • fastparquet version: 0.4.1
  • Python version: 3.8.4
  • Operating System: Win10
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
jsignellcommented, Mar 1, 2021

I think that’s enough to close it. Thanks for reporting back here @Madhu94

1reaction
jsignellcommented, Nov 16, 2020

My vote would be to upgrade the fastparquet logic to match the doc as filtering is kind of useless for multi-partitioned data otherwise.

You are welcome to open an issue or PR over on fastparquet to request this feature, but in the meantime I think the docs should be made to reflect reality 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask.dataframe.read_parquet - Dask documentation
Note that the “fastparquet” engine does not currently support DNF for the filtering of partitioned columns ( List[Tuple] is required).
Read more >
Python and Parquet Performance - Data Syndrome
This post outlines how to use all common Python libraries to read and write Parquet format while taking advantage of columnar storage, ...
Read more >
API — fastparquet 0.7.1 documentation
If filters are applied, it may happen that data is so reduced that 'nrows' is not ensured (fewer rows). returns: dataframe. property info¶....
Read more >
Reading index based range from Parquet File using Python
I've tried pandas with fastparquet engine and even pyarraw but can't seem to find any option to do so. Is there any way...
Read more >
pyarrow.parquet.ParquetDataset — Apache Arrow v10.0.1
Rows which do not match the filter predicate will be removed from scanned data. ... level filtering and different partitioning schemes are supported....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found