Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How about introduce a light filter strategy that based on the col used in Dataset#repartition.

See original GitHub issue

Builded data like this:df.repartition(20, col("id")).write.parquet(path)

When filter like this: filter(col("id") === 123), we can prune 19 repartition files, without any overhead.

It’s very simple to implement, we needn’t create the index, just call the same hash function that Dataset#repartition used and get the specified file in listFilesWithIndexSupport.

I almost have done with that, but I have a little concern about the entry point that enables this(Now we’ll create the index when found there’s no index, seems no perfect way to Inject this, or implement a new MetastoreSupport).

And if you are OK about this feature, I can give a PR first, looking forward to your kind advice.

Issue Analytics

State:
Created 5 years ago
Comments:10 (10 by maintainers)

Top GitHub Comments

1reaction

sadikovicommented, Dec 5, 2018

First of all, the feature you are trying to add is similar to bucketing, so it might be worth researching that a little bit. And yes, it is supported (with managed tables).

Yes, you do need to keep track of columns that dataset is repartitioned by. No one is going to provide that column for you, even more remember what column was used to repartition dataset that was created months ago. So you would have to record column(s) somehow and perform validation to tell user whether or not this is the correct column that they can filter by (or fall back to normal filtering, either one will do).
Probably, the name is too specific, but I don’t see adding anything else besides Parquet, considering the amount of features other datasources have.
Not necessarily, it depends what predicate is. You can repartition by multiple columns, but then bucketing filter should only be triggered when predicate appears to have all columns required.
GroupBy example is a variation of repartition, you would have to group keys together in order to perform aggregation. If I save such DataFrame you can also look at the plan and decide to store index for groupBy columns.
We should support bucketing, IMHO.
What about non-equality filters? I did not see any assertions or validations of such. It looks like you can only apply equality filters on bucketed file layout.
IMHO, this feature should be automated, and users should not provide any columns to filter, since we can manage all of that ourselves.
I am curious, does current version of index show the same performance on repartitioned dataset? Have you run any benchmarks? Because if there is no performance improvement, I don’t see a point of investing more effort into it.

Anyway, it is a good start, but this work needs some design work, IMHO.

1reaction

sadikovicommented, Dec 1, 2018

That is a really good idea. Please do submit a PR.

Top Results From Across the Web

Distributed tables design guidance - Azure Synapse Analytics

Choosing distribution column(s) is an important design decision since the values in the hash column(s) determine how the rows are distributed.

PySpark repartition() vs partitionBy() - Spark by {Examples}

PySpark partitionBy() is a method of DataFrameWriter class which is used to write the DataFrame to disk in partitions, one sub-directory for each...

Issues · lightcopy/parquet-index · GitHub

How about introduce a light filter strategy that based on the col used in Dataset#repartition. ... ProTip! Type g i on any issue...

pandas GroupBy: Your Guide to Grouping Data in Python

This most commonly means using .filter() to drop entire groups based on some comparative statistic about that group and its sub-table. It also ......

Oracle Partitioning in Oracle Database 12c Release 2

For partitioned tables the zone map also contains the aggregated minimum and maximum column values for every partition. Whenever a SQL operation is...