question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Slow file listing after update to Hudi 0.10.0

See original GitHub issue

Description

I have two tables with large amount of partitions (~300k). Both contain almost the same data, but were created and updated with slightly different configurations and versions of Hudi. For some reason I see a significant time difference in file listing when reading both tables. A new table spends much more time listing files in many instants, when in the other table there are none (please see logs below).

The first table is managed by Hudi 0.8.0. It was created with a few INSERT commits and then updated daily with UPSERT operation. Table is auto cleaned after each commit. Hudi configuration:

 HoodieWriteConfig.TABLE_NAME                           -> "table_v1",  
 DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY         -> "event_id",  
 DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY        -> "generated_at",  
 DataSourceWriteOptions.OPERATION_OPT_KEY               -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,  
 DataSourceWriteOptions.TABLE_TYPE_OPT_KEY              -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,  
 DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY     -> "date,source,type",  
 DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY      -> classOf[ComplexKeyGeneratorWithLowerCasePartitionPath].getName,  
 DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true",  
 HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES  -> 200.mb.toBytes.toLong.toString,  
 HoodieStorageConfig.PARQUET_FILE_MAX_BYTES             -> 1.gb.toBytes.toLong.toString,  
 DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY       -> "false",  
 HoodieMetadataConfig.METADATA_ENABLE_PROP              -> "true",
 HoodieWriteConfig.UPSERT_PARALLELISM                   -> "15000"

The second table was created after application pipeline was migrated to Hudi 0.10.0. The table was created with a few INSERT_OVERWRITE commits and then updated daily with UPSERT operation. Table auto clean is disabled, because cleaning operation suffered from long file listing times (it always took ~3 hours). Instead the table is cleaned with org.apache.hudi.utilities.HoodieCleaner later and takes about 30 minutes.

Hudi configuration:

 HoodieWriteConfig.TBL_NAME.key                         -> "table_v2",  
 KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key           -> "event_id",  
 HoodieWriteConfig.PRECOMBINE_FIELD_NAME.key            -> "generated_at",  
 DataSourceWriteOptions.OPERATION.key                   -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,  
 DataSourceWriteOptions.TABLE_TYPE.key                  -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,  
 KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key       -> "date,source,type",  
 HoodieWriteConfig.KEYGENERATOR_CLASS_NAME.key          -> classOf[ComplexKeyGenerator].getName,  
 KeyGeneratorOptions.HIVE_STYLE_PARTITIONING_ENABLE.key -> "true",  
 HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT.key    -> 192.mb.toBytes.toLong.toString,  
 HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key          -> 256.mb.toBytes.toLong.toString,  
 DataSourceWriteOptions.HIVE_SYNC_ENABLED.key           -> "false",  
 HoodieMetadataConfig.ENABLE.key                        -> "true",  
 HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key         -> "15000",  
 HoodieWriteConfig.COMBINE_BEFORE_UPSERT.key            -> "false",  
 HoodieCompactionConfig.AUTO_CLEAN.key                  -> "false"  

Only differences I can see between both tables are:

  • use of different Hudi versions (0.8.0 vs 0.10.0)
  • different output parquet file sizes
  • disabled auto clean
  • use of INSERT_OVERWRITE instead of INSERT during initial backfill

I would appreciate help answering a few questions:

  • Why clean operation is much slower (minutes vs hours) between Hudi 0.8.0 and 0.10.0? I know it’s because of number of partitions, but is it possible to bring old performance with some configuration changes?
  • Why file listing times for both tables are so different? How could it be fixed?

Thanks!

How tables are read

I cleaned both tables and read from them a few partitions using Hudi 0.10.0. I disabled table metadata and provide paths to specific partitions in READ_PATHS.

Example:

spark.read.format("org.apache.hudi").
  option("hoodie.metadata.enable", "false").
  option("hoodie.datasource.read.paths", "s3://bucket/table_v1/date=2021-12-30/source=test/type=test,s3://bucket/table_v1/date=2021-12-31/source=test/type=test").
  load()

Expected behavior

It used to take a few seconds to list files in provided partitions, but now it takes minutes.

Environment Description

  • Hudi version : 0.10.0
  • Spark version : 3.1.1
  • Hadoop version : 3.2.1
  • Storage : S3
  • Running on Docker? : no

Stacktrace

Logs from reading table 1 (fast):

INFO AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v1
INFO AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO AbstractTableFileSystemView: Building file system view for partition (date=2021-12-30/source=test/type=test)
INFO AbstractTableFileSystemView: addFilesToView: NumFiles=28, NumFileGroups=27, FileGroupsCreationTime=1, StoreTimeTaken=0
INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v1, caching 27 files under s3://bucket/table_v1/date=2021-12-30/source=test/type=test
INFO AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v1
INFO AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO AbstractTableFileSystemView: Building file system view for partition (date=2021-12-31/source=test/type=test)
INFO AbstractTableFileSystemView: addFilesToView: NumFiles=21, NumFileGroups=20, FileGroupsCreationTime=1, StoreTimeTaken=0
INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v1, caching 20 files under s3://bucket/table_v1/date=2021-12-31/source=test/type=test

Logs from reading table 2 (slow):

INFO AbstractTableFileSystemView: Took 8508 ms to read  17 instants, 15201 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v2
INFO AbstractTableFileSystemView: Took 8468 ms to read  17 instants, 15201 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO AbstractTableFileSystemView: Building file system view for partition (date=2021-12-19/source=test/type=test)
INFO AbstractTableFileSystemView: addFilesToView: NumFiles=47, NumFileGroups=46, FileGroupsCreationTime=3, StoreTimeTaken=0
INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v2, caching 46 files under s3://bucket/table_v2/date=2021-12-19/source=test/type=test
INFO AbstractTableFileSystemView: Took 8513 ms to read  17 instants, 15201 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v2
INFO AbstractTableFileSystemView: Took 9192 ms to read  17 instants, 15201 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO AbstractTableFileSystemView: Building file system view for partition (date=2021-12-21/source=test/type=test)
INFO AbstractTableFileSystemView: addFilesToView: NumFiles=71, NumFileGroups=70, FileGroupsCreationTime=5, StoreTimeTaken=0
INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v2, caching 70 files under s3://bucket/table_v2/date=2021-12-21/source=test/type=test

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
ganczarekcommented, Sep 15, 2022

@nsivabalan I’m sorry, but I no longer have a setup to test it the same way as I did back in January. I will update to Hudi v0.12 and enable metadata table to see if there are performance improvements, but I won’t test it with a table with 300k partitions.

I think we can close this ticket. If I see any performance issues with the latest version of Hudi, then I will create a new ticket.

Thank you for your help and assistance.

0reactions
nsivabalancommented, Sep 13, 2022

@ganczarek : have you tried giving latest version of hudi (0.12) which has few critical perf fixes. also, we stabilized our metadata table in 0.11 and above. So, if you can give it a try and let us know how it goes, would be nice.

Read more comments on GitHub >

github_iconTop Results From Across the Web

subject:"\[GitHub\] \[hudi\] ganczarek commented on issue #4656
[GitHub] [hudi] ganczarek commented on issue #4656: [SUPPORT] Slow file listing after update to Hudi 0.10.0 · 2022-09-15 Thread GitBox.
Read more >
FAQs | Apache Hudi
Any new data that is written to the Hudi dataset using MOR table type, will write new log/delta files that internally store the...
Read more >
Configurations - Apache Hudi
This is documentation for Apache Hudi 0.10.0, which is no longer actively maintained. ... Comma separated list of file paths to read within...
Read more >
All Configurations | Apache Hudi
Comma separated list of file paths to read within a Hudi table. ... The following set of configurations help validate new data before...
Read more >
Performance | Apache Hudi
A key aspect of storing data on DFS is managing file sizes and counts and ... Writes will simply expand given file groups...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found