Setting "hoodie.parquet.max.file.size" to a value >= 2 GiB leads to no data being generated
See original GitHub issueEnvironment: EMR cluster version 5.27 using hudi-spark-bundle-0.5.1-SNAPSHOT.jar.
“spark.executor.memory” is configured as 6018M.
Running the following code, with “hoodie.parquet.max.file.size” to 2* 1024 * 1024 * 1024 (2GiB) generates no data -
val inputDF = spark.read.format("parquet").load("s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1/")
inputDF.write
.format("org.apache.hudi")
.option("hoodie.parquet.max.file.size",String.valueOf(2*1024 * 1024 * 1024))
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "request_ip")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "request_verb")
.option(HoodieWriteConfig.TABLE_NAME, "elb_logs_hudi_cow")
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "request_timestamp")
.mode(SaveMode.Overwrite)
.save("s3://my-bucket/prefix")
All the stages and tasks complete successfully with no OOM or error messages being printed in the log. There is a commit files in .hoodie folder, but that is also empty.
Changing the option “hoodie.parquet.max.file.size” to 1024 * 1024 * 1024 (1GiB) makes everything work as expected.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (5 by maintainers)
Top Results From Across the Web
Setting "hoodie.parquet.max.file.size" to a value >= 2 GiB ...
Setting "hoodie.parquet.max.file.size" to a value >= 2 GiB leads to no data being generated #971.
Read more >[GitHub] [incubator-hudi] vinothchandar edited a comment on issue ...
vinothchandar edited a comment on issue #971: Setting "hoodie.parquet.max.file.size" to a value >= 2 GiB leads to no data being generated URL: ...
Read more >Configurations - Apache Hudi
This controls the number of commit instants read in memory as a batch and archived together. compactionSmallFileSize(size = 100MB). Property: hoodie.parquet.
Read more >Clustering - Online Documentation Platform - Huawei
parquet.small.file.limit to configure the minimum file size. You can set it to 0 to force new data to be written to new file...
Read more >Tag Archives: Best practices - Noise
We'll first identify the AWS service or services where the authentication can be ... hoodie.parquet.max.file.size, Target size for Parquet files produced by ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Nonetheless, bounds checking on configs needs to be improved still 😃
Np. I did not anticipate that either 😃 . added to https://issues.apache.org/jira/browse/HUDI-89 . Closing this.