question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Slow snapshot query performance

See original GitHub issue

Environment Hudi Version - 0.7 Spark - 2.4.7 DFS - GCS

Issue

The hudi table is partitioned by day and has around 95 partitions. Each partition is between 5 GB to 15 GB and total size is around 930 GB. When I fire a query (count(*), count(distinct), select * ) on a single day partition, with default configurations in Hudi 0.8.0 it takes around 3 mins. This was very slow so I tried below 2 approaches.

  1. Compression
  2. Use hoodie.file.index.enable property

Sample Query: select count(distinct visit_nbr) from cntry_visit_hudi_tgt where cntry = ‘US’ and part_date = ‘2020-11-02’

bash-4.2$ gsutil du -sh gs://hudi-storage/hudi_table_test_v1 931.61 GiB gs://hudi-storage/hudi_table_test_v1

gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-02/ 7.07 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-03/ 6.93 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-04/ 6.14 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-05/ 6.61 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-06/ 7.42 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-07/ 8.58 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-08/ 8.11 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-09/ 6.78 GiB gs://hudi-storage/hudi_table_test_v1/cntry=US/part_date=2020-11-10/ 6.63 GiB

image (1)

Approach 1: Used snappy compression with 0.95% compression ratio in 0.8.0 while writing the data. This reduced the table size to 540 GB and there was marginal improvement. From Spark UI it seemed file listing is taking time. But the “hoodie.file.index.enable” property is not available in 0.8.0.

bash-4.2$ gsutil du -sh gs://hudi-storage/hudi_table_test_v1 540.21 GiB gs://hudi-storage/hudi_table_test_v1

option(“hoodie.parquet.compression.codec”, “SNAPPY”). option(“hoodie.parquet.compression.ratio”, “0.95”).

Screenshot 2021-09-01 at 6 12 04 PM

Approach 2: Tried using hudi-spark-bundle_2.12-0.9.0-SNAPSHOT.jar and used “hoodie.file.index.enable” = true while reading

        val df = spark.read.format("hudi")
          .option("hoodie.file.index.enable", true)
          .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
          .load("gs://hudi-storage/hudi_table_test_v1")
        df.createOrReplaceTempView("cntry_visit_hudi_tgt")
        val df = spark.sql(select count(distinct visit_nbr) from cntry_visit_hudi_tgt where cntry = 'US' and part_date = '2020-11-02').show(false)
Screenshot 2021-09-01 at 6 13 50 PM Still it appears to take time during file listing and total time taken for the query execution is around 1 min 40 secs.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
codejoyancommented, Sep 19, 2021

It works without the file listing step and is reading from metadata table !!! @xushiyan Thanks a lot for the solution.

0reactions
xushiyancommented, Sep 19, 2021

@codejoyan after diving into the code, i saw the logic to enable HoodieFileIndex, which we should have clearly stated somewhere in the docs. Please try loading the dataset without glob path, like reading from "gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v3". Also to make sure partition pruning works, can you see if your hoodie.properties file contains hoodie.table.partition.fields= cntry,part_date ? If not, please add it manually there. Let us know how it works for you.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Improve Your Database Performance with Query ...
If you're running into application scalability issues, or expect to at some point, here's how query snapshots can help.
Read more >
AWR or STATSPACK Snapshot collection extremely slow in ...
Very poor performance while attempting to generate a snapshot for either the AWR ( automatic workload repository ) or Statspack reports. More ...
Read more >
Req: Support snapshots with very slow queries (current limit is ...
Hi, When trying to view a snapshot dashboard I am getting an error "Problem! Unauthorized." I can see where the graphs should be...
Read more >
Acc: You may encounter slow performance or hangs when ...
Symptoms. In Microsoft Access you may encounter slow performance using pass-through queries as source tables within other queries.
Read more >
Collecting Db2 snapshots during ISDS LDAP slow performance
During slow performance, enabling DB2 snapshots can help detect which SQL statements are taking long to process in the access plan.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found