[SUPPORT] Hudi creates duplicate, redundant file during clustering
See original GitHub issueSummary During clustering, Hudi creates duplicate parquet file with the same file group ID and identical content. One of the two files are later marked as a duplicate and deleted. I’m using inline clustering with a single writer, so there’s no concurrency issues at play.
Details
The two spark jobs above are triggered during inline clustering.
As can be seen above, both of these Spark jobs trigger MultipleSparkJobExecutionStrategy.performClustering
method. This ends up creating and storing two identical clustered files.
When HoodieTable.reconcileAgainstMarkers
runs, the newer file is identified as a duplicate, and is deleted.
Expected behavior
Hudi should only store the clustered file once. Storing a duplicate of the file unnecessarily increases the duration.
Environment Description
-
Hudi version : 0.11.0
-
Spark version : 3.1.2
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : No
Additional context
I’m using Copy on Write and inline clustering. My write config is:
.write
.format(HUDI_WRITE_FORMAT)
.option(TBL_NAME.key(), tableName)
.option(TABLE_TYPE.key(), COW_TABLE_TYPE_OPT_VAL)
.option(PARTITIONPATH_FIELD.key(), ...)
.option(PRECOMBINE_FIELD.key(), ...)
.option(COMBINE_BEFORE_INSERT.key(), "true")
.option(KEYGENERATOR_CLASS_NAME.key(), CUSTOM_KEY_GENERATOR)
.option(URL_ENCODE_PARTITIONING.key(), "true")
.option(HIVE_SYNC_ENABLED.key(), "true")
.option(HIVE_DATABASE.key(), ...)
.option(HIVE_PARTITION_FIELDS.key(), ...)
.option(HIVE_TABLE.key(), tableName)
.option(HIVE_TABLE_PROPERTIES.key(), tableName)
.option(HIVE_PARTITION_EXTRACTOR_CLASS.key(), MULTI_PART_KEYS_VALUE_EXTRACTOR)
.option(HIVE_USE_JDBC.key(), "false")
.option(HIVE_SUPPORT_TIMESTAMP_TYPE.key(), "true")
.option(HIVE_STYLE_PARTITIONING.key(), "true")
.option(KeyGeneratorOptions.Config.TIMESTAMP_TYPE_FIELD_PROP, INPUT_TIMESTAMP_TYPE)
.option(KeyGeneratorOptions.Config.INPUT_TIME_UNIT, INPUT_TIMESTAMP_UNIT)
.option(KeyGeneratorOptions.Config.TIMESTAMP_OUTPUT_DATE_FORMAT_PROP, OUTPUT_TIMESTAMP_FORMAT)
.option(OPERATION.key(), UPSERT_OPERATION_OPT_VAL)
.option(INLINE_CLUSTERING.key(), "true")
.option(INLINE_CLUSTERING_MAX_COMMITS.key(), "2")
.option(PLAN_STRATEGY_SMALL_FILE_LIMIT.key(), "73400320") // 70MB
.option(PLAN_STRATEGY_TARGET_FILE_MAX_BYTES.key(), "209715200") // 200MB
.option(COPY_ON_WRITE_RECORD_SIZE_ESTIMATE.key(), ...)
.option(PARQUET_MAX_FILE_SIZE.key(), "104857600") // 100MB
.option(PARQUET_SMALL_FILE_LIMIT.key(), "104857600") // 100MB
.option(PARALLELISM_VALUE.key(), getParallelism().toString)
.option(FILE_LISTING_PARALLELISM_VALUE.key(), getParallelism().toString)
.option(FINALIZE_WRITE_PARALLELISM_VALUE.key(), getParallelism().toString)
.option(DELETE_PARALLELISM_VALUE.key(), getParallelism().toString)
.option(INSERT_PARALLELISM_VALUE.key(), getParallelism().toString)
.option(ROLLBACK_PARALLELISM_VALUE.key(), getParallelism().toString)
.option(UPSERT_PARALLELISM_VALUE.key(), getParallelism().toString)
.option(INDEX_TYPE.key(), indexType)
.option(SIMPLE_INDEX_PARALLELISM.key(), getParallelism().toString)
.option(AUTO_CLEAN.key(), "true")
.option(CLEANER_PARALLELISM_VALUE.key(), getParallelism().toString)
.option(CLEANER_COMMITS_RETAINED.key(), "10")
.option(PRESERVE_COMMIT_METADATA.key(), "true")
.option(HoodieMetadataConfig.ENABLE.key(), "true")
.option(META_SYNC_CONDITIONAL_SYNC.key(), "false")
.option(ROLLBACK_PENDING_CLUSTERING_ON_CONFLICT.key(), "true")
.option(UPDATES_STRATEGY.key(), "org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy")
.option(MARKERS_TYPE.key(), MarkerType.DIRECT.toString)
.mode(SaveMode.Append)
Issue Analytics
- State:
- Created a year ago
- Comments:14 (7 by maintainers)
Top GitHub Comments
Here is the fix https://github.com/apache/hudi/pull/6561 Can you verify w/ the patch, you don’t see such duplicates.
thanks!