question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hudi creates duplicate, redundant file during clustering

See original GitHub issue

Summary During clustering, Hudi creates duplicate parquet file with the same file group ID and identical content. One of the two files are later marked as a duplicate and deleted. I’m using inline clustering with a single writer, so there’s no concurrency issues at play.

Details Screen Shot 2022-07-26 at 8 59 27 am Screen Shot 2022-07-26 at 8 59 17 am

The two spark jobs above are triggered during inline clustering.

Screen Shot 2022-07-26 at 9 00 59 am Screen Shot 2022-07-26 at 9 00 55 am

As can be seen above, both of these Spark jobs trigger MultipleSparkJobExecutionStrategy.performClustering method. This ends up creating and storing two identical clustered files.

When HoodieTable.reconcileAgainstMarkers runs, the newer file is identified as a duplicate, and is deleted.

Expected behavior

Hudi should only store the clustered file once. Storing a duplicate of the file unnecessarily increases the duration.

Environment Description

  • Hudi version : 0.11.0

  • Spark version : 3.1.2

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : No

Additional context

I’m using Copy on Write and inline clustering. My write config is:

.write
      .format(HUDI_WRITE_FORMAT)
      .option(TBL_NAME.key(), tableName)
      .option(TABLE_TYPE.key(), COW_TABLE_TYPE_OPT_VAL)
      .option(PARTITIONPATH_FIELD.key(), ...)
      .option(PRECOMBINE_FIELD.key(), ...)
      .option(COMBINE_BEFORE_INSERT.key(), "true")
      .option(KEYGENERATOR_CLASS_NAME.key(), CUSTOM_KEY_GENERATOR)
      .option(URL_ENCODE_PARTITIONING.key(), "true")
      .option(HIVE_SYNC_ENABLED.key(), "true")
      .option(HIVE_DATABASE.key(), ...)
      .option(HIVE_PARTITION_FIELDS.key(), ...)
      .option(HIVE_TABLE.key(), tableName)
      .option(HIVE_TABLE_PROPERTIES.key(), tableName)
      .option(HIVE_PARTITION_EXTRACTOR_CLASS.key(), MULTI_PART_KEYS_VALUE_EXTRACTOR)
      .option(HIVE_USE_JDBC.key(), "false")
      .option(HIVE_SUPPORT_TIMESTAMP_TYPE.key(), "true")
      .option(HIVE_STYLE_PARTITIONING.key(), "true")
      .option(KeyGeneratorOptions.Config.TIMESTAMP_TYPE_FIELD_PROP, INPUT_TIMESTAMP_TYPE)
      .option(KeyGeneratorOptions.Config.INPUT_TIME_UNIT, INPUT_TIMESTAMP_UNIT)
      .option(KeyGeneratorOptions.Config.TIMESTAMP_OUTPUT_DATE_FORMAT_PROP, OUTPUT_TIMESTAMP_FORMAT)
      .option(OPERATION.key(), UPSERT_OPERATION_OPT_VAL)
      .option(INLINE_CLUSTERING.key(), "true")
      .option(INLINE_CLUSTERING_MAX_COMMITS.key(), "2")
      .option(PLAN_STRATEGY_SMALL_FILE_LIMIT.key(), "73400320") // 70MB
      .option(PLAN_STRATEGY_TARGET_FILE_MAX_BYTES.key(), "209715200") // 200MB
      .option(COPY_ON_WRITE_RECORD_SIZE_ESTIMATE.key(), ...)
      .option(PARQUET_MAX_FILE_SIZE.key(), "104857600") // 100MB
      .option(PARQUET_SMALL_FILE_LIMIT.key(), "104857600") // 100MB
      .option(PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(FILE_LISTING_PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(FINALIZE_WRITE_PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(DELETE_PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(INSERT_PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(ROLLBACK_PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(UPSERT_PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(INDEX_TYPE.key(), indexType)
      .option(SIMPLE_INDEX_PARALLELISM.key(), getParallelism().toString)
      .option(AUTO_CLEAN.key(), "true")
      .option(CLEANER_PARALLELISM_VALUE.key(), getParallelism().toString)
      .option(CLEANER_COMMITS_RETAINED.key(), "10")
      .option(PRESERVE_COMMIT_METADATA.key(), "true")
      .option(HoodieMetadataConfig.ENABLE.key(), "true")
      .option(META_SYNC_CONDITIONAL_SYNC.key(), "false")
      .option(ROLLBACK_PENDING_CLUSTERING_ON_CONFLICT.key(), "true")
      .option(UPDATES_STRATEGY.key(), "org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy")
      .option(MARKERS_TYPE.key(), MarkerType.DIRECT.toString)
      .mode(SaveMode.Append)

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
nsivabalancommented, Sep 1, 2022

Here is the fix https://github.com/apache/hudi/pull/6561 Can you verify w/ the patch, you don’t see such duplicates.

0reactions
nsivabalancommented, Sep 6, 2022

thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Clustering - Apache Hudi
The strategy uses bulk insert to write data into new files, in which case, Hudi implicitly uses a partitioner that does sorting based...
Read more >
Build your Apache Hudi data lake on AWS using Amazon EMR
With Hoodie keys, you can enable efficient updates and deletes on records, as well as avoid duplicate records.
Read more >
Hive connector — Trino 403 Documentation
The Hive connector allows querying data stored in an Apache Hive data warehouse. Hive is a combination of three components: Data files in...
Read more >
Apache Hudi - Design/Code Walkthrough Session ... - YouTube
It supports upserts, incremental pull/querying, read optimized query, self manages file sizes to mitigate small file problems in large ...
Read more >
How does Lake House work: using Apache Hudi as an example
In Hudi a batch of data can be written to multiple partitions. Some of the partitions are newly created while others already exist...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found