question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] S3 throttling while loading a table written with "hoodie.metadata.enable" = true

See original GitHub issue

Describe the problem you faced

Our Hudi data lake is heavily partitioned by datasource, year, and month.

We have 1000 datasources currently loaded into the lake, and are looking to load 1000 more over 2 bulk_insert batches. Our Hudi data lake is a Java application that has custom schema validation logic. It is achieved by calling spark.read.format("hudi").load(existingTable) to retrieve the schema for comparison with the incoming data.

When trying to create our Hudi data lake on 0.7 with no metadata table, we observed S3 slow down errors during the stage Listing leaf files and directories for xxx paths:. We upgraded to Hudi 0.9 and enabled metadata table, restarting our experiment. However, we still observe intermittent S3 throttling issues when calling load() on the existing table.

image

Hardware: Master: 1 x m6g.8xlarge Core: 48 x r5.8xlarge

To Reproduce

Steps to reproduce the behavior:

  1. Create a large, heavily partitioned, COW Hudi table with metadata enabled
  2. Our size currently is 7k+ paths, 216k objects, 3.5TB compressed parquet
  3. Load the table with hardware above
  4. Observe S3 throttling issue

Expected behavior

I expected the S3 throttling issue to not occur, as it is documented that the purpose of the metadata table is to prevent these issues on reads.

Environment Description

  • Hudi version : 0.9.0-amzn-0 (EMR 5.34.0)

  • Spark version : 2.4.8

  • Hive version : 2.3.8

  • Hadoop version : 2.10.1

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : no

Additional context

Write operation type for all records so far: bulk_insert Metadata table enabled: hoodie.metadata.enable = true

Stacktrace

Job aborted due to stage failure: Task 1272 in stage 63.0 failed 4 times, most recent failure: Lost task 1272.3 in stage 63.0 (TID 185468, ip-xx-xx-5-238.ec2.internal, executor 11): java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: xx; S3 Extended Request ID: xx; Proxy: null), S3 Extended Request ID: xx	at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:421)
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:654)
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:625)
	at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:473)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$listLeafFiles(InMemoryFileIndex.scala:320)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3$$anonfun$apply$2.apply(InMemoryFileIndex.scala:218)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3$$anonfun$apply$2.apply(InMemoryFileIndex.scala:217)
	at scala.collection.immutable.Stream.map(Stream.scala:418)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3.apply(InMemoryFileIndex.scala:217)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$3.apply(InMemoryFileIndex.scala:215)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:411)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:417)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: xx; S3 Extended Request ID: xx; Proxy: null), S3 Extended Request ID: xx
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5445)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5392)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5386)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.listObjectsV2(AmazonS3Client.java:971)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.call.ListObjectsV2Call.perform(ListObjectsV2Call.java:26)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.call.ListObjectsV2Call.perform(ListObjectsV2Call.java:12)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor$CallPerformer.call(GlobalS3Executor.java:108)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:135)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:191)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:186)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.listObjectsV2(AmazonS3LiteClient.java:75)
	at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:412)
	... 25 more

Note: This does not happen every time, and it appears to occur more frequently with more executors provided. However, if I lower the # of executors the queries take far too long for data ingestion.

I’m willing to provide additional information and configuration as required. We are trying to understand the root cause of this S3 throttling issue as it breaks the pipeline intermittently

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
noahtaitecommented, Sep 29, 2022

Hi @nsivabalan - we had to table this one until EMR released Hudi 0.11. We now have access to EMR 6.7 which supports Hudi 0.11, but no EMR 5.x version (Spark 2) does, We’re porting our code to Spark 3 and will test the metadata table performance. Apologies for the delay, and thank you for the support.

0reactions
nsivabalancommented, Nov 4, 2022

@noahtaite : going ahead and closing this one for now. Feel free to raise a new issue if you are looking for further assistance.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [hudi] noahtaite commented on issue #6048: [SUPPORT ...
[GitHub] [hudi] noahtaite commented on issue #6048: [SUPPORT] S3 throttling while loading a table written with "hoodie.metadata.enable" = true.
Read more >
New features from Apache Hudi 0.9.0 on Amazon EMR
Furthermore, Apache Hudi lets you maintain data in Amazon S3 or Apache HDFS in ... hoodie.datasource.write.hive_style_partitioning = 'true' ...
Read more >
Metadata Table - Apache Hudi
The Apache Hudi Metadata Table can significantly improve read/write ... hoodie.metadata.index.column.stats.enable to true , when metadata table is enabled.
Read more >
Can the Hudi Metadata table be queried? - Stack Overflow
I used the following Hudi options to create a table in S3 using ... "hoodie.insert.shuffle.parallelism": 6, "hoodie.metadata.enable": "true" ...
Read more >
Reliable ingestion from AWS S3 using Apache Hudi - Onehouse
When the events are committed to the S3 metadata table they will be ... Write to Hudi table with source data corresponding to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found