question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Setting "hoodie.parquet.max.file.size" to a value >= 2 GiB leads to no data being generated

See original GitHub issue

Environment: EMR cluster version 5.27 using hudi-spark-bundle-0.5.1-SNAPSHOT.jar.
“spark.executor.memory” is configured as 6018M.

Running the following code, with “hoodie.parquet.max.file.size” to 2* 1024 * 1024 * 1024 (2GiB) generates no data -

val inputDF = spark.read.format("parquet").load("s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1/")

 inputDF.write 
  .format("org.apache.hudi")
  .option("hoodie.parquet.max.file.size",String.valueOf(2*1024 * 1024 * 1024))
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "request_ip")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "request_verb")
  .option(HoodieWriteConfig.TABLE_NAME, "elb_logs_hudi_cow")
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "request_timestamp")
  .mode(SaveMode.Overwrite)
  .save("s3://my-bucket/prefix")

All the stages and tasks complete successfully with no OOM or error messages being printed in the log. There is a commit files in .hoodie folder, but that is also empty.

Changing the option “hoodie.parquet.max.file.size” to 1024 * 1024 * 1024 (1GiB) makes everything work as expected.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
vinothchandarcommented, Oct 25, 2019

Nonetheless, bounds checking on configs needs to be improved still 😃

0reactions
vinothchandarcommented, Oct 26, 2019

Np. I did not anticipate that either 😃 . added to https://issues.apache.org/jira/browse/HUDI-89 . Closing this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Setting "hoodie.parquet.max.file.size" to a value >= 2 GiB ...
Setting "hoodie.parquet.max.file.size" to a value >= 2 GiB leads to no data being generated #971.
Read more >
[GitHub] [incubator-hudi] vinothchandar edited a comment on issue ...
vinothchandar edited a comment on issue #971: Setting "hoodie.parquet.max.file.size" to a value >= 2 GiB leads to no data being generated URL: ...
Read more >
Configurations - Apache Hudi
This controls the number of commit instants read in memory as a batch and archived together. compactionSmallFileSize(size = 100MB)​. Property: hoodie.parquet.
Read more >
Clustering - Online Documentation Platform - Huawei
parquet.small.file.limit to configure the minimum file size. You can set it to 0 to force new data to be written to new file...
Read more >
Tag Archives: Best practices - Noise
We'll first identify the AWS service or services where the authentication can be ... hoodie.parquet.max.file.size, Target size for Parquet files produced by ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found