[SUPPORT] Slow snapshot query performance
See original GitHub issueEnvironment Hudi Version - 0.7 Spark - 2.4.7 DFS - GCS
Issue
The hudi table is partitioned by day and has around 95 partitions. Each partition is between 5 GB to 15 GB and total size is around 930 GB. When I fire a query (count(*), count(distinct), select * ) on a single day partition, with default configurations in Hudi 0.8.0 it takes around 3 mins. This was very slow so I tried below 2 approaches.
- Compression
- Use hoodie.file.index.enable property
Sample Query: select count(distinct visit_nbr) from cntry_visit_hudi_tgt where cntry = ‘US’ and part_date = ‘2020-11-02’
bash-4.2$ gsutil du -sh gs://hudi-storage/hudi_table_test_v1 931.61 GiB gs://hudi-storage/hudi_table_test_v1
gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-02/ 7.07 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-03/ 6.93 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-04/ 6.14 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-05/ 6.61 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-06/ 7.42 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-07/ 8.58 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-08/ 8.11 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-09/ 6.78 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-10/ 6.63 GiB
Approach 1: Used snappy compression with 0.95% compression ratio in 0.8.0 while writing the data. This reduced the table size to 540 GB and there was marginal improvement. From Spark UI it seemed file listing is taking time. But the “hoodie.file.index.enable” property is not available in 0.8.0.
bash-4.2$ gsutil du -sh gs://hudi-storage/hudi_table_test_v1 540.21 GiB gs://hudi-storage/hudi_table_test_v1
option(“hoodie.parquet.compression.codec”, “SNAPPY”). option(“hoodie.parquet.compression.ratio”, “0.95”).
Approach 2: Tried using hudi-spark-bundle_2.12-0.9.0-SNAPSHOT.jar and used “hoodie.file.index.enable” = true while reading
val df = spark.read.format("hudi")
.option("hoodie.file.index.enable", true)
.option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
.load("gs://hudi-storage/hudi_table_test_v1")
df.createOrReplaceTempView("cntry_visit_hudi_tgt")
val df = spark.sql(select count(distinct visit_nbr) from cntry_visit_hudi_tgt where cntry = 'US' and part_date = '2020-11-02').show(false)
Still it appears to take time during file listing and total time taken for the query execution is around 1 min 40 secs.
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (11 by maintainers)
Top GitHub Comments
It works without the file listing step and is reading from metadata table !!! @xushiyan Thanks a lot for the solution.
@codejoyan after diving into the code, i saw the logic to enable HoodieFileIndex, which we should have clearly stated somewhere in the docs. Please try loading the dataset without glob path, like reading from
"gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v3"
. Also to make sure partition pruning works, can you see if yourhoodie.properties
file containshoodie.table.partition.fields= cntry,part_date
? If not, please add it manually there. Let us know how it works for you.