Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Slow snapshot query performance

See original GitHub issue

Environment Hudi Version - 0.7 Spark - 2.4.7 DFS - GCS

Issue

The hudi table is partitioned by day and has around 95 partitions. Each partition is between 5 GB to 15 GB and total size is around 930 GB. When I fire a query (count(*), count(distinct), select * ) on a single day partition, with default configurations in Hudi 0.8.0 it takes around 3 mins. This was very slow so I tried below 2 approaches.

Compression
Use hoodie.file.index.enable property

Sample Query: select count(distinct visit_nbr) from cntry_visit_hudi_tgt where cntry = ‘US’ and part_date = ‘2020-11-02’

bash-4.2$ gsutil du -sh gs://hudi-storage/hudi_table_test_v1 931.61 GiB gs://hudi-storage/hudi_table_test_v1

gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-02/ 7.07 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-03/ 6.93 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-04/ 6.14 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-05/ 6.61 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-06/ 7.42 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-07/ 8.58 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-08/ 8.11 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-09/ 6.78 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-10/ 6.63 GiB

image (1)

Approach 1: Used snappy compression with 0.95% compression ratio in 0.8.0 while writing the data. This reduced the table size to 540 GB and there was marginal improvement. From Spark UI it seemed file listing is taking time. But the “hoodie.file.index.enable” property is not available in 0.8.0.

bash-4.2$ gsutil du -sh gs://hudi-storage/hudi_table_test_v1 540.21 GiB gs://hudi-storage/hudi_table_test_v1

option(“hoodie.parquet.compression.codec”, “SNAPPY”). option(“hoodie.parquet.compression.ratio”, “0.95”).

Approach 2: Tried using hudi-spark-bundle_2.12-0.9.0-SNAPSHOT.jar and used “hoodie.file.index.enable” = true while reading

        val df = spark.read.format("hudi")
          .option("hoodie.file.index.enable", true)
          .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
          .load("gs://hudi-storage/hudi_table_test_v1")
        df.createOrReplaceTempView("cntry_visit_hudi_tgt")
        val df = spark.sql(select count(distinct visit_nbr) from cntry_visit_hudi_tgt where cntry = 'US' and part_date = '2020-11-02').show(false)

Still it appears to take time during file listing and total time taken for the query execution is around 1 min 40 secs.

Issue Analytics

State:
Created 2 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

codejoyancommented, Sep 19, 2021

It works without the file listing step and is reading from metadata table !!! @xushiyan Thanks a lot for the solution.

0reactions

xushiyancommented, Sep 19, 2021

@codejoyan after diving into the code, i saw the logic to enable HoodieFileIndex, which we should have clearly stated somewhere in the docs. Please try loading the dataset without glob path, like reading from "gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v3". Also to make sure partition pruning works, can you see if your hoodie.properties file contains hoodie.table.partition.fields= cntry,part_date ? If not, please add it manually there. Let us know how it works for you.