question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How about introduce a light filter strategy that based on the col used in Dataset#repartition.

See original GitHub issue

Builded data like this:df.repartition(20, col("id")).write.parquet(path)

When filter like this: filter(col("id") === 123), we can prune 19 repartition files, without any overhead.

It’s very simple to implement, we needn’t create the index, just call the same hash function that Dataset#repartition used and get the specified file in listFilesWithIndexSupport.

I almost have done with that, but I have a little concern about the entry point that enables this(Now we’ll create the index when found there’s no index, seems no perfect way to Inject this, or implement a new MetastoreSupport).

And if you are OK about this feature, I can give a PR first, looking forward to your kind advice.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
sadikovicommented, Dec 5, 2018

First of all, the feature you are trying to add is similar to bucketing, so it might be worth researching that a little bit. And yes, it is supported (with managed tables).

  • Yes, you do need to keep track of columns that dataset is repartitioned by. No one is going to provide that column for you, even more remember what column was used to repartition dataset that was created months ago. So you would have to record column(s) somehow and perform validation to tell user whether or not this is the correct column that they can filter by (or fall back to normal filtering, either one will do).

  • Probably, the name is too specific, but I don’t see adding anything else besides Parquet, considering the amount of features other datasources have.

  • Not necessarily, it depends what predicate is. You can repartition by multiple columns, but then bucketing filter should only be triggered when predicate appears to have all columns required.

  • GroupBy example is a variation of repartition, you would have to group keys together in order to perform aggregation. If I save such DataFrame you can also look at the plan and decide to store index for groupBy columns.

  • We should support bucketing, IMHO.

  • What about non-equality filters? I did not see any assertions or validations of such. It looks like you can only apply equality filters on bucketed file layout.

  • IMHO, this feature should be automated, and users should not provide any columns to filter, since we can manage all of that ourselves.

  • I am curious, does current version of index show the same performance on repartitioned dataset? Have you run any benchmarks? Because if there is no performance improvement, I don’t see a point of investing more effort into it.

Anyway, it is a good start, but this work needs some design work, IMHO.

1reaction
sadikovicommented, Dec 1, 2018

That is a really good idea. Please do submit a PR.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Distributed tables design guidance - Azure Synapse Analytics
Choosing distribution column(s) is an important design decision since the values in the hash column(s) determine how the rows are distributed.
Read more >
PySpark repartition() vs partitionBy() - Spark by {Examples}
PySpark partitionBy() is a method of DataFrameWriter class which is used to write the DataFrame to disk in partitions, one sub-directory for each...
Read more >
Issues · lightcopy/parquet-index · GitHub
How about introduce a light filter strategy that based on the col used in Dataset#repartition. ... ProTip! Type g i on any issue...
Read more >
pandas GroupBy: Your Guide to Grouping Data in Python
This most commonly means using .filter() to drop entire groups based on some comparative statistic about that group and its sub-table. It also ......
Read more >
Oracle Partitioning in Oracle Database 12c Release 2
For partitioned tables the zone map also contains the aggregated minimum and maximum column values for every partition. Whenever a SQL operation is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found