Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hudi creates duplicate, redundant file during clustering

See original GitHub issue

Summary During clustering, Hudi creates duplicate parquet file with the same file group ID and identical content. One of the two files are later marked as a duplicate and deleted. I’m using inline clustering with a single writer, so there’s no concurrency issues at play.

Details Screen Shot 2022-07-26 at 8 59 27 am

The two spark jobs above are triggered during inline clustering.

Screen Shot 2022-07-26 at 9 00 59 am

As can be seen above, both of these Spark jobs trigger MultipleSparkJobExecutionStrategy.performClustering method. This ends up creating and storing two identical clustered files.

When HoodieTable.reconcileAgainstMarkers runs, the newer file is identified as a duplicate, and is deleted.

Expected behavior

Hudi should only store the clustered file once. Storing a duplicate of the file unnecessarily increases the duration.

Environment Description

Hudi version : 0.11.0
Spark version : 3.1.2
Storage (HDFS/S3/GCS…) : S3
Running on Docker? (yes/no) : No

Additional context

I’m using Copy on Write and inline clustering. My write config is:

.write
      .format(HUDI_WRITE_FORMAT)
      .option(TBL_NAME.key(), tableName)
      .option(TABLE_TYPE.key(), COW_TABLE_TYPE_OPT_VAL)
      .option(PARTITIONPATH_FIELD.key(), ...)
      .option(PRECOMBINE_FIELD.key(), ...)
      .option(COMBINE_BEFORE_INSERT.key(), "true")
      .option(KEYGENERATOR_CLASS_NAME.key(), CUSTOM_KEY_GENERATOR)
      .option(URL_ENCODE_PARTITIONING.key(), "true")
      .option(HIVE_SYNC_ENABLED.key(), "true")
      .option(HIVE_DATABASE.key(), ...)
      .option(HIVE_PARTITION_FIELDS.key(), ...)
      .option(HIVE_TABLE.key(), tableName)
      .option(HIVE_TABLE_PROPERTIES.key(), tableName)
      .option(HIVE_PARTITION_EXTRACTOR_CLASS.key(), MULTI_PART_KEYS_VALUE_EXTRACTOR)
      .option(HIVE_USE_JDBC.key(), "false")
      .option(HIVE_SUPPORT_TIMESTAMP_TYPE.key(), "true")
      .option(HIVE_STYLE_PARTITIONING.key(), "true")
      .option(KeyGeneratorOptions.Config.TIMESTAMP_TYPE_FIELD_PROP, INPUT_TIMESTAMP_TYPE)
      .option(KeyGeneratorOptions.Config.INPUT_TIME_UNIT, INPUT_TIMESTAMP_UNIT)
      .option(KeyGeneratorOptions.Config.TIMESTAMP_OUTPUT_DATE_FORMAT_PROP, OUTPUT_TIMESTAMP_FORMAT)
      .option(OPERATION.key(), UPSERT_OPERATION_OPT_VAL)
      .option(INLINE_CLUSTERING.key(), "true")
      .option(INLINE_CLUSTERING_MAX_COMMITS.key(), "2")
      .option(PLAN_STRATEGY_SMALL_FILE_LIMIT.key(), "73400320") // 70MB
      .option(PLAN_STRATEGY_TARGET_FILE_MAX_BYTES.key(), "209715200") // 200MB
      .option(COPY_ON_WRITE_RECORD_SIZE_ESTIMATE.key(), ...)
      .option(PARQUET_MAX_FILE_SIZE.key(), "104857600") // 100MB
      .option(PARQUET_SMALL_FILE_LIMIT.key(), "104857600") // 100MB
      .option(PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(FILE_LISTING_PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(FINALIZE_WRITE_PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(DELETE_PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(INSERT_PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(ROLLBACK_PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(UPSERT_PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(INDEX_TYPE.key(), indexType)
      .option(SIMPLE_INDEX_PARALLELISM.key(), getParallelism().toString)
      .option(AUTO_CLEAN.key(), "true")
      .option(CLEANER_PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(CLEANER_COMMITS_RETAINED.key(), "10")
      .option(PRESERVE_COMMIT_METADATA.key(), "true")
      .option(HoodieMetadataConfig.ENABLE.key(), "true")
      .option(META_SYNC_CONDITIONAL_SYNC.key(), "false")
      .option(ROLLBACK_PENDING_CLUSTERING_ON_CONFLICT.key(), "true")
      .option(UPDATES_STRATEGY.key(), "org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy")
      .option(MARKERS_TYPE.key(), MarkerType.DIRECT.toString)
      .mode(SaveMode.Append)