question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Skip empty partitions with optimize_metadata_queries

See original GitHub issue

#14845 introduces a rule to rewrite select max(col) from T into select max(col) from "T$partitions" if col is a partition column. But as stated in #14845, the rewrite might not be correct when the returned max partition is empty. A probable fix is to list the content of the partition and check if there is any non-empty files inside.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
yuanzhanhkucommented, Aug 2, 2021

@arunthirupathi Thanks for the clarification. To summarize it with my only words(please correct me if I am wrong):

  1. If metastore has stats for a partition and shows that it is not empty, we can safely assume that the partition is not empty on disk. With this assumption, we can safely enable the rewrite by default for queries reading partitions in this case.
  2. If the metastore has stats for a partition but it shows that it is empty, it is possible that the partition is not empty on disk(partition1 in your example). The chance of this happening is small.
  3. If the metastore has no stats for a partition, it might still be non empty. (partition2 and partition 3 in your example).

These are still inline with my statements in the previous comment. So we can safely turn on the rewrite for case 1 which might benefit most queries.

0reactions
arunthirupathicommented, Aug 2, 2021

It is complicated at least on how the Stats is implemented at Facebook.

  1. Let us say, we have 3 large partitions. Spark had only time to process the partitiion 1, so the partitiion1 will have stats, partition2 and partition3 will have no stats. Assume Partition1 contains mostly empty files in the start, but towards the end it had non empty files. Spark has only time to process the empty files, so it wrote the empty partition stats with 0 rows. But actually it could contain the data. This is purely hypothetical, so not sure if we care about this use case. (This is most likely going to happen if you bucket a table into 100,000 buckets (large number of buckets and write one record (less rows).

  2. Metastore has implemented a back ground job to collect the number of files and total size of the files. This has 99.9% accuracy when I last checked, but this is not partition stats.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Optimizer properties — Trino 403 Documentation
Enable optimization of some aggregations by using values that are stored as metadata. This allows Trino to execute some simple queries in constant...
Read more >
Managing partitioned tables | BigQuery - Google Cloud
Getting partition metadata using meta-tables. In legacy SQL, you can get metadata about table partitions by querying the __PARTITIONS_SUMMARY__ meta-table. Meta ...
Read more >
Execute queries over $partitions at planning time · Issue #3027
The first query is automatically optimized to the second one if the optimize_metadata_queries session variable is set to true.
Read more >
Improve query performance using AWS Glue partition indexes
AWS Glue partition indexes are an important configuration to reduce overall data transfers and processing, and reduce query processing time. In ...
Read more >
Optimizing Table Data Structures - SingleStore Documentation
The partitioning of data into segments will likely be entirely controlled by insert_datetime6 and not use region_id . This means that queries ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found