Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Skip empty partitions with optimize_metadata_queries

See original GitHub issue

#14845 introduces a rule to rewrite select max(col) from T into select max(col) from "T$partitions" if col is a partition column. But as stated in #14845, the rewrite might not be correct when the returned max partition is empty. A probable fix is to list the content of the partition and check if there is any non-empty files inside.

Issue Analytics

State:
Created 2 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

yuanzhanhkucommented, Aug 2, 2021

@arunthirupathi Thanks for the clarification. To summarize it with my only words(please correct me if I am wrong):

If metastore has stats for a partition and shows that it is not empty, we can safely assume that the partition is not empty on disk. With this assumption, we can safely enable the rewrite by default for queries reading partitions in this case.
If the metastore has stats for a partition but it shows that it is empty, it is possible that the partition is not empty on disk(partition1 in your example). The chance of this happening is small.
If the metastore has no stats for a partition, it might still be non empty. (partition2 and partition 3 in your example).

These are still inline with my statements in the previous comment. So we can safely turn on the rewrite for case 1 which might benefit most queries.

0reactions

arunthirupathicommented, Aug 2, 2021

It is complicated at least on how the Stats is implemented at Facebook.

Let us say, we have 3 large partitions. Spark had only time to process the partitiion 1, so the partitiion1 will have stats, partition2 and partition3 will have no stats. Assume Partition1 contains mostly empty files in the start, but towards the end it had non empty files. Spark has only time to process the empty files, so it wrote the empty partition stats with 0 rows. But actually it could contain the data. This is purely hypothetical, so not sure if we care about this use case. (This is most likely going to happen if you bucket a table into 100,000 buckets (large number of buckets and write one record (less rows).
Metastore has implemented a back ground job to collect the number of files and total size of the files. This has 99.9% accuracy when I last checked, but this is not partition stats.

Top Results From Across the Web

Optimizer properties — Trino 403 Documentation

Enable optimization of some aggregations by using values that are stored as metadata. This allows Trino to execute some simple queries in constant...

Managing partitioned tables | BigQuery - Google Cloud

Getting partition metadata using meta-tables. In legacy SQL, you can get metadata about table partitions by querying the __PARTITIONS_SUMMARY__ meta-table. Meta ...

Execute queries over $partitions at planning time · Issue #3027

The first query is automatically optimized to the second one if the optimize_metadata_queries session variable is set to true.

Improve query performance using AWS Glue partition indexes

AWS Glue partition indexes are an important configuration to reduce overall data transfers and processing, and reduce query processing time. In ...

Optimizing Table Data Structures - SingleStore Documentation

The partitioning of data into segments will likely be entirely controlled by insert_datetime6 and not use region_id . This means that queries ......