Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] S3 throttling while loading a table written with "hoodie.metadata.enable" = true

See original GitHub issue

Describe the problem you faced

Our Hudi data lake is heavily partitioned by datasource, year, and month.

We have 1000 datasources currently loaded into the lake, and are looking to load 1000 more over 2 bulk_insert batches. Our Hudi data lake is a Java application that has custom schema validation logic. It is achieved by calling spark.read.format("hudi").load(existingTable) to retrieve the schema for comparison with the incoming data.

When trying to create our Hudi data lake on 0.7 with no metadata table, we observed S3 slow down errors during the stage Listing leaf files and directories for xxx paths:. We upgraded to Hudi 0.9 and enabled metadata table, restarting our experiment. However, we still observe intermittent S3 throttling issues when calling load() on the existing table.

Hardware: Master: 1 x m6g.8xlarge Core: 48 x r5.8xlarge

To Reproduce

Steps to reproduce the behavior:

Create a large, heavily partitioned, COW Hudi table with metadata enabled
Our size currently is 7k+ paths, 216k objects, 3.5TB compressed parquet
Load the table with hardware above
Observe S3 throttling issue

Expected behavior

I expected the S3 throttling issue to not occur, as it is documented that the purpose of the metadata table is to prevent these issues on reads.

Environment Description

Hudi version : 0.9.0-amzn-0 (EMR 5.34.0)
Spark version : 2.4.8
Hive version : 2.3.8
Hadoop version : 2.10.1
Storage (HDFS/S3/GCS…) : S3
Running on Docker? (yes/no) : no

Additional context

Write operation type for all records so far: bulk_insert Metadata table enabled: hoodie.metadata.enable = true

Stacktrace

Job aborted due to stage failure: Task 1272 in stage 63.0 failed 4 times, most recent failure: Lost task 1272.3 in stage 63.0 (TID 185468, ip-xx-xx-5-238.ec2.internal, executor 11): java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: xx; S3 Extended Request ID: xx; Proxy: null), S3 Extended Request ID: xx	at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:421)
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:654)
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:625)
	at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:473)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$listLeafFiles(InMemoryFileIndex.scala:320)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3$$anonfun$apply$2.apply(InMemoryFileIndex.scala:218)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3$$anonfun$apply$2.apply(InMemoryFileIndex.scala:217)
	at scala.collection.immutable.Stream.map(Stream.scala:418)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3.apply(InMemoryFileIndex.scala:217)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3.apply(InMemoryFileIndex.scala:215)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:411)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:417)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: xx; S3 Extended Request ID: xx; Proxy: null), S3 Extended Request ID: xx
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5445)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5392)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5386)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.listObjectsV2(AmazonS3Client.java:971)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.call.ListObjectsV2Call.perform(ListObjectsV2Call.java:26)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.call.ListObjectsV2Call.perform(ListObjectsV2Call.java:12)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor$CallPerformer.call(GlobalS3Executor.java:108)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:135)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:191)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:186)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.listObjectsV2(AmazonS3LiteClient.java:75)
	at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:412)
	... 25 more

Note: This does not happen every time, and it appears to occur more frequently with more executors provided. However, if I lower the # of executors the queries take far too long for data ingestion.

I’m willing to provide additional information and configuration as required. We are trying to understand the root cause of this S3 throttling issue as it breaks the pipeline intermittently

Issue Analytics

State:
Created a year ago
Comments:9 (6 by maintainers)

Top GitHub Comments

1reaction

noahtaitecommented, Sep 29, 2022

Hi @nsivabalan - we had to table this one until EMR released Hudi 0.11. We now have access to EMR 6.7 which supports Hudi 0.11, but no EMR 5.x version (Spark 2) does, We’re porting our code to Spark 3 and will test the metadata table performance. Apologies for the delay, and thank you for the support.

0reactions

nsivabalancommented, Nov 4, 2022

@noahtaite : going ahead and closing this one for now. Feel free to raise a new issue if you are looking for further assistance.

Top Results From Across the Web

[GitHub] [hudi] noahtaite commented on issue #6048: [SUPPORT ...

[GitHub] [hudi] noahtaite commented on issue #6048: [SUPPORT] S3 throttling while loading a table written with "hoodie.metadata.enable" = true.

New features from Apache Hudi 0.9.0 on Amazon EMR

Furthermore, Apache Hudi lets you maintain data in Amazon S3 or Apache HDFS in ... hoodie.datasource.write.hive_style_partitioning = 'true' ...

Metadata Table - Apache Hudi

The Apache Hudi Metadata Table can significantly improve read/write ... hoodie.metadata.index.column.stats.enable to true , when metadata table is enabled.

Can the Hudi Metadata table be queried? - Stack Overflow

I used the following Hudi options to create a table in S3 using ... "hoodie.insert.shuffle.parallelism": 6, "hoodie.metadata.enable": "true" ...

Reliable ingestion from AWS S3 using Apache Hudi - Onehouse

When the events are committed to the S3 metadata table they will be ... Write to Hudi table with source data corresponding to...