How about introduce a light filter strategy that based on the col used in Dataset#repartition.
See original GitHub issueBuilded data like this:df.repartition(20, col("id")).write.parquet(path)
When filter like this: filter(col("id") === 123)
, we can prune 19 repartition files, without any overhead.
It’s very simple to implement, we needn’t create the index, just call the same hash function that Dataset#repartition
used and get the specified file in listFilesWithIndexSupport
.
I almost have done with that, but I have a little concern about the entry point that enables this(Now we’ll create the index when found there’s no index, seems no perfect way to Inject this, or implement a new MetastoreSupport).
And if you are OK about this feature, I can give a PR first, looking forward to your kind advice.
Issue Analytics
- State:
- Created 5 years ago
- Comments:10 (10 by maintainers)
Top Results From Across the Web
Distributed tables design guidance - Azure Synapse Analytics
Choosing distribution column(s) is an important design decision since the values in the hash column(s) determine how the rows are distributed.
Read more >PySpark repartition() vs partitionBy() - Spark by {Examples}
PySpark partitionBy() is a method of DataFrameWriter class which is used to write the DataFrame to disk in partitions, one sub-directory for each...
Read more >Issues · lightcopy/parquet-index · GitHub
How about introduce a light filter strategy that based on the col used in Dataset#repartition. ... ProTip! Type g i on any issue...
Read more >pandas GroupBy: Your Guide to Grouping Data in Python
This most commonly means using .filter() to drop entire groups based on some comparative statistic about that group and its sub-table. It also ......
Read more >Oracle Partitioning in Oracle Database 12c Release 2
For partitioned tables the zone map also contains the aggregated minimum and maximum column values for every partition. Whenever a SQL operation is...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
First of all, the feature you are trying to add is similar to bucketing, so it might be worth researching that a little bit. And yes, it is supported (with managed tables).
Yes, you do need to keep track of columns that dataset is repartitioned by. No one is going to provide that column for you, even more remember what column was used to repartition dataset that was created months ago. So you would have to record column(s) somehow and perform validation to tell user whether or not this is the correct column that they can filter by (or fall back to normal filtering, either one will do).
Probably, the name is too specific, but I don’t see adding anything else besides Parquet, considering the amount of features other datasources have.
Not necessarily, it depends what predicate is. You can repartition by multiple columns, but then bucketing filter should only be triggered when predicate appears to have all columns required.
GroupBy example is a variation of repartition, you would have to group keys together in order to perform aggregation. If I save such DataFrame you can also look at the plan and decide to store index for groupBy columns.
We should support bucketing, IMHO.
What about non-equality filters? I did not see any assertions or validations of such. It looks like you can only apply equality filters on bucketed file layout.
IMHO, this feature should be automated, and users should not provide any columns to filter, since we can manage all of that ourselves.
I am curious, does current version of index show the same performance on repartitioned dataset? Have you run any benchmarks? Because if there is no performance improvement, I don’t see a point of investing more effort into it.
Anyway, it is a good start, but this work needs some design work, IMHO.
That is a really good idea. Please do submit a PR.