question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Auto-clean doesn't work

See original GitHub issue

I’m trying to use Hudi with Spark EMR. Everything is ok when I run a batch job of S3 data. But when I run it of Kinesis stream it creates tens of versions of the output file and never removes them.

To Reproduce

This is my code

  val hudiOptions = Map[String,String](
    DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
    DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
    HoodieStorageConfig.PARQUET_COMPRESSION_CODEC -> "snappy",
    HoodieCompactionConfig.AUTO_CLEAN_PROP -> "true",
    HoodieCompactionConfig.CLEANER_FILE_VERSIONS_RETAINED_PROP -> "1",
    HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP -> "1",
    DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true",
    "hoodie.upsert.shuffle.parallelism" -> "5",
    HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES -> (512 * 1024 * 1024).toString,
    "hoodie.combine.before.insert" -> "true",
    DataSourceWriteOptions.INSERT_DROP_DUPS_OPT_KEY -> "true"
  )

  dataframe
    .write
    .format("org.apache.hudi")
    .options(hudiOptions)
    .option(HoodieWriteConfig.TABLE_NAME, tableName)
    .option(DataSourceWriteOptions.TABLE_NAME_OPT_KEY, tableName)
    .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionPathField)
    .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyField)
    .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, precombineFieldKey)
    .mode(SaveMode.Append)
    .save(destinationPath)

Environment Description

  • Hudi version: 0.6.0

  • Spark version: 2.4.4

  • Hive version: 2.3.7

  • Hadoop version: 2.10

  • Storage (HDFS/S3/GCS…): S3

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
bvaradarcommented, Oct 16, 2020
0reactions
halkarcommented, Oct 16, 2020

@bvaradar thanks for confirming. Are there any plans to support concurrent writes? I’ll try to change the logic not do concurrent writes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fixing an Oven That Won't Turn on After Self-Cleaning
If you are a frequent user of the self-clean, when the oven ceases to function, this should be the first issue that is...
Read more >
debian - `sudo apt autoclean` doesn't work
clean clears out the local repository of retrieved package files. It removes everything but the lock file from /var/cache/apt/archives/ and /var ...
Read more >
Don't Use Your Oven's Self-Cleaning Feature. Here's Why.
Using the self-clean feature of your oven doesn't guarantee an expensive repair, but, based on my experience, it's probable. On This Page. How ......
Read more >
Self-Cleaning Oven Won't Work? Here's What to Do
As mentioned above, when you run a self-cleaning cycle, the oven heats up to a very high temperature, so a tripped thermal fuse...
Read more >
Solved: MF743Cdw stuck on Performing auto clean fixing ass...
It doesn't matter that "they have been working". This is a cumulative issue that can result from a accumulation of toner on the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found