Skip empty partitions with optimize_metadata_queries
See original GitHub issue#14845 introduces a rule to rewrite select max(col) from T
into select max(col) from "T$partitions"
if col
is a partition column. But as stated in #14845, the rewrite might not be correct when the returned max partition is empty. A probable fix is to list the content of the partition and check if there is any non-empty files inside.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Optimizer properties — Trino 403 Documentation
Enable optimization of some aggregations by using values that are stored as metadata. This allows Trino to execute some simple queries in constant...
Read more >Managing partitioned tables | BigQuery - Google Cloud
Getting partition metadata using meta-tables. In legacy SQL, you can get metadata about table partitions by querying the __PARTITIONS_SUMMARY__ meta-table. Meta ...
Read more >Execute queries over $partitions at planning time · Issue #3027
The first query is automatically optimized to the second one if the optimize_metadata_queries session variable is set to true.
Read more >Improve query performance using AWS Glue partition indexes
AWS Glue partition indexes are an important configuration to reduce overall data transfers and processing, and reduce query processing time. In ...
Read more >Optimizing Table Data Structures - SingleStore Documentation
The partitioning of data into segments will likely be entirely controlled by insert_datetime6 and not use region_id . This means that queries ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@arunthirupathi Thanks for the clarification. To summarize it with my only words(please correct me if I am wrong):
These are still inline with my statements in the previous comment. So we can safely turn on the rewrite for case 1 which might benefit most queries.
It is complicated at least on how the Stats is implemented at Facebook.
Let us say, we have 3 large partitions. Spark had only time to process the partitiion 1, so the partitiion1 will have stats, partition2 and partition3 will have no stats. Assume Partition1 contains mostly empty files in the start, but towards the end it had non empty files. Spark has only time to process the empty files, so it wrote the empty partition stats with 0 rows. But actually it could contain the data. This is purely hypothetical, so not sure if we care about this use case. (This is most likely going to happen if you bucket a table into 100,000 buckets (large number of buckets and write one record (less rows).
Metastore has implemented a back ground job to collect the number of files and total size of the files. This has 99.9% accuracy when I last checked, but this is not partition stats.