read_parquet with fastparquet does not support DNF filters
See original GitHub issueWhat happened:
Calling read_parquet
with filters in DNF format (i.e. List of Lists of Tuples), using the fastparquet backend, does not filter partitions and returns all partitions instead.
What you expected to happen:
I expected filters to be applied as stated in the documentation of read_parquet
.
Minimal Complete Verifiable Example:
I expect the following to return parts that have dataset_name == 'option1'
OR dataset_name == 'option2'
. Instead, all the partitions are returned regardless of dataset_name
.
from dask import dataframe as dd
dd.read_parquet(
'/some/path/to.parquet', infer_divisions=True,
filters=[
[
('dataset_name', '==', 'option1'),
('dataset_name', '==', 'option2')
]
])
Anything else we need to know?:
From a shallow analysis, seems fastparquet.api.filter_out_cats
does not handle the “list of lists of tuples” input, and expects only list of tuples. The former does not crash the code but results in a nonsensical comparison in the fastparquet code where the tuple is compared against the column name, so that a filter is never applied.
Environment:
- Dask version: 2.21.0
- fastparquet version: 0.4.1
- Python version: 3.8.4
- Operating System: Win10
- Install method (conda, pip, source): pip
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (6 by maintainers)
I think that’s enough to close it. Thanks for reporting back here @Madhu94
You are welcome to open an issue or PR over on fastparquet to request this feature, but in the meantime I think the docs should be made to reflect reality 😃