[SUPPORT] Auto-clean doesn't work
See original GitHub issueI’m trying to use Hudi with Spark EMR. Everything is ok when I run a batch job of S3 data. But when I run it of Kinesis stream it creates tens of versions of the output file and never removes them.
To Reproduce
This is my code
val hudiOptions = Map[String,String](
DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
HoodieStorageConfig.PARQUET_COMPRESSION_CODEC -> "snappy",
HoodieCompactionConfig.AUTO_CLEAN_PROP -> "true",
HoodieCompactionConfig.CLEANER_FILE_VERSIONS_RETAINED_PROP -> "1",
HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP -> "1",
DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true",
"hoodie.upsert.shuffle.parallelism" -> "5",
HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES -> (512 * 1024 * 1024).toString,
"hoodie.combine.before.insert" -> "true",
DataSourceWriteOptions.INSERT_DROP_DUPS_OPT_KEY -> "true"
)
dataframe
.write
.format("org.apache.hudi")
.options(hudiOptions)
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.option(DataSourceWriteOptions.TABLE_NAME_OPT_KEY, tableName)
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionPathField)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyField)
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, precombineFieldKey)
.mode(SaveMode.Append)
.save(destinationPath)
Environment Description
-
Hudi version: 0.6.0
-
Spark version: 2.4.4
-
Hive version: 2.3.7
-
Hadoop version: 2.10
-
Storage (HDFS/S3/GCS…): S3
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (4 by maintainers)
Top Results From Across the Web
Fixing an Oven That Won't Turn on After Self-Cleaning
If you are a frequent user of the self-clean, when the oven ceases to function, this should be the first issue that is...
Read more >debian - `sudo apt autoclean` doesn't work
clean clears out the local repository of retrieved package files. It removes everything but the lock file from /var/cache/apt/archives/ and /var ...
Read more >Don't Use Your Oven's Self-Cleaning Feature. Here's Why.
Using the self-clean feature of your oven doesn't guarantee an expensive repair, but, based on my experience, it's probable. On This Page. How ......
Read more >Self-Cleaning Oven Won't Work? Here's What to Do
As mentioned above, when you run a self-cleaning cycle, the oven heats up to a very high temperature, so a tripped thermal fuse...
Read more >Solved: MF743Cdw stuck on Performing auto clean fixing ass...
It doesn't matter that "they have been working". This is a cumulative issue that can result from a accumulation of toner on the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@halkar : Yes, https://issues.apache.org/jira/browse/HUDI-845 tracks it
@bvaradar thanks for confirming. Are there any plans to support concurrent writes? I’ll try to change the logic not do concurrent writes.