Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Setting "hoodie.parquet.max.file.size" to a value >= 2 GiB leads to no data being generated

See original GitHub issue

Environment: EMR cluster version 5.27 using hudi-spark-bundle-0.5.1-SNAPSHOT.jar.
“spark.executor.memory” is configured as 6018M.

Running the following code, with “hoodie.parquet.max.file.size” to 2* 1024 * 1024 * 1024 (2GiB) generates no data -

val inputDF = spark.read.format("parquet").load("s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1/")

 inputDF.write 
  .format("org.apache.hudi")
  .option("hoodie.parquet.max.file.size",String.valueOf(2*1024 * 1024 * 1024))
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "request_ip")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "request_verb")
  .option(HoodieWriteConfig.TABLE_NAME, "elb_logs_hudi_cow")
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "request_timestamp")
  .mode(SaveMode.Overwrite)
  .save("s3://my-bucket/prefix")

All the stages and tasks complete successfully with no OOM or error messages being printed in the log. There is a commit files in .hoodie folder, but that is also empty.

Changing the option “hoodie.parquet.max.file.size” to 1024 * 1024 * 1024 (1GiB) makes everything work as expected.

Issue Analytics

State:
Created 4 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

vinothchandarcommented, Oct 25, 2019

Nonetheless, bounds checking on configs needs to be improved still 😃

0reactions

vinothchandarcommented, Oct 26, 2019

Np. I did not anticipate that either 😃 . added to https://issues.apache.org/jira/browse/HUDI-89 . Closing this.

Top Results From Across the Web

Setting "hoodie.parquet.max.file.size" to a value >= 2 GiB ...

Setting "hoodie.parquet.max.file.size" to a value >= 2 GiB leads to no data being generated #971.

[GitHub] [incubator-hudi] vinothchandar edited a comment on issue ...

vinothchandar edited a comment on issue #971: Setting "hoodie.parquet.max.file.size" to a value >= 2 GiB leads to no data being generated URL: ...

Configurations - Apache Hudi

This controls the number of commit instants read in memory as a batch and archived together. compactionSmallFileSize(size = 100MB). Property: hoodie.parquet.

Clustering - Online Documentation Platform - Huawei

parquet.small.file.limit to configure the minimum file size. You can set it to 0 to force new data to be written to new file...

Tag Archives: Best practices - Noise

We'll first identify the AWS service or services where the authentication can be ... hoodie.parquet.max.file.size, Target size for Parquet files produced by ...