[SUPPORT] Slow file listing after update to Hudi 0.10.0
See original GitHub issueDescription
I have two tables with large amount of partitions (~300k). Both contain almost the same data, but were created and updated with slightly different configurations and versions of Hudi. For some reason I see a significant time difference in file listing when reading both tables. A new table spends much more time listing files in many instants, when in the other table there are none (please see logs below).
The first table is managed by Hudi 0.8.0. It was created with a few INSERT commits and then updated daily with UPSERT operation. Table is auto cleaned after each commit. Hudi configuration:
HoodieWriteConfig.TABLE_NAME -> "table_v1",
DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "event_id",
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "generated_at",
DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "date,source,type",
DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> classOf[ComplexKeyGeneratorWithLowerCasePartitionPath].getName,
DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true",
HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES -> 200.mb.toBytes.toLong.toString,
HoodieStorageConfig.PARQUET_FILE_MAX_BYTES -> 1.gb.toBytes.toLong.toString,
DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "false",
HoodieMetadataConfig.METADATA_ENABLE_PROP -> "true",
HoodieWriteConfig.UPSERT_PARALLELISM -> "15000"
The second table was created after application pipeline was migrated to Hudi 0.10.0. The table was created with a few
INSERT_OVERWRITE commits and then updated daily with UPSERT operation. Table auto clean is disabled, because cleaning operation suffered from long file listing times (it always took ~3 hours). Instead the table is cleaned with org.apache.hudi.utilities.HoodieCleaner
later and takes about 30 minutes.
Hudi configuration:
HoodieWriteConfig.TBL_NAME.key -> "table_v2",
KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key -> "event_id",
HoodieWriteConfig.PRECOMBINE_FIELD_NAME.key -> "generated_at",
DataSourceWriteOptions.OPERATION.key -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
DataSourceWriteOptions.TABLE_TYPE.key -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key -> "date,source,type",
HoodieWriteConfig.KEYGENERATOR_CLASS_NAME.key -> classOf[ComplexKeyGenerator].getName,
KeyGeneratorOptions.HIVE_STYLE_PARTITIONING_ENABLE.key -> "true",
HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT.key -> 192.mb.toBytes.toLong.toString,
HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key -> 256.mb.toBytes.toLong.toString,
DataSourceWriteOptions.HIVE_SYNC_ENABLED.key -> "false",
HoodieMetadataConfig.ENABLE.key -> "true",
HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key -> "15000",
HoodieWriteConfig.COMBINE_BEFORE_UPSERT.key -> "false",
HoodieCompactionConfig.AUTO_CLEAN.key -> "false"
Only differences I can see between both tables are:
- use of different Hudi versions (0.8.0 vs 0.10.0)
- different output parquet file sizes
- disabled auto clean
- use of INSERT_OVERWRITE instead of INSERT during initial backfill
I would appreciate help answering a few questions:
- Why clean operation is much slower (minutes vs hours) between Hudi 0.8.0 and 0.10.0? I know it’s because of number of partitions, but is it possible to bring old performance with some configuration changes?
- Why file listing times for both tables are so different? How could it be fixed?
Thanks!
How tables are read
I cleaned both tables and read from them a few partitions using Hudi 0.10.0. I disabled table metadata and
provide paths to specific partitions in READ_PATHS
.
Example:
spark.read.format("org.apache.hudi").
option("hoodie.metadata.enable", "false").
option("hoodie.datasource.read.paths", "s3://bucket/table_v1/date=2021-12-30/source=test/type=test,s3://bucket/table_v1/date=2021-12-31/source=test/type=test").
load()
Expected behavior
It used to take a few seconds to list files in provided partitions, but now it takes minutes.
Environment Description
- Hudi version : 0.10.0
- Spark version : 3.1.1
- Hadoop version : 3.2.1
- Storage : S3
- Running on Docker? : no
Stacktrace
Logs from reading table 1 (fast):
INFO AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v1
INFO AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO AbstractTableFileSystemView: Building file system view for partition (date=2021-12-30/source=test/type=test)
INFO AbstractTableFileSystemView: addFilesToView: NumFiles=28, NumFileGroups=27, FileGroupsCreationTime=1, StoreTimeTaken=0
INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v1, caching 27 files under s3://bucket/table_v1/date=2021-12-30/source=test/type=test
INFO AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v1
INFO AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO AbstractTableFileSystemView: Building file system view for partition (date=2021-12-31/source=test/type=test)
INFO AbstractTableFileSystemView: addFilesToView: NumFiles=21, NumFileGroups=20, FileGroupsCreationTime=1, StoreTimeTaken=0
INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v1, caching 20 files under s3://bucket/table_v1/date=2021-12-31/source=test/type=test
Logs from reading table 2 (slow):
INFO AbstractTableFileSystemView: Took 8508 ms to read 17 instants, 15201 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v2
INFO AbstractTableFileSystemView: Took 8468 ms to read 17 instants, 15201 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO AbstractTableFileSystemView: Building file system view for partition (date=2021-12-19/source=test/type=test)
INFO AbstractTableFileSystemView: addFilesToView: NumFiles=47, NumFileGroups=46, FileGroupsCreationTime=3, StoreTimeTaken=0
INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v2, caching 46 files under s3://bucket/table_v2/date=2021-12-19/source=test/type=test
INFO AbstractTableFileSystemView: Took 8513 ms to read 17 instants, 15201 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v2
INFO AbstractTableFileSystemView: Took 9192 ms to read 17 instants, 15201 replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO AbstractTableFileSystemView: Building file system view for partition (date=2021-12-21/source=test/type=test)
INFO AbstractTableFileSystemView: addFilesToView: NumFiles=71, NumFileGroups=70, FileGroupsCreationTime=5, StoreTimeTaken=0
INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v2, caching 70 files under s3://bucket/table_v2/date=2021-12-21/source=test/type=test
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (5 by maintainers)
Top GitHub Comments
@nsivabalan I’m sorry, but I no longer have a setup to test it the same way as I did back in January. I will update to Hudi v0.12 and enable metadata table to see if there are performance improvements, but I won’t test it with a table with 300k partitions.
I think we can close this ticket. If I see any performance issues with the latest version of Hudi, then I will create a new ticket.
Thank you for your help and assistance.
@ganczarek : have you tried giving latest version of hudi (0.12) which has few critical perf fixes. also, we stabilized our metadata table in 0.11 and above. So, if you can give it a try and let us know how it goes, would be nice.