question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Too Many Requests with AWS Batch backend

See original GitHub issue
  • Backend: AWS
  • Cromwell: 36

When running certain highly parallel WDL workflows, I’m getting the error cromwell.core.CromwellFatalException: software.amazon.awssdk.services.batch.model.BatchException: Too Many Requests (Service: null; Status Code: 429; Request ID: cffe6e45-d66c-11e8-a1df-05402551b0ba). The specific case where this happens is in the gatk3-data-processing workflow, when running the ApplyBQSR task, which is run in parallel over some calculated intervals.

The full error trace I get is:

2018-10-23 02:39:07,631 cromwell-system-akka.dispatchers.backend-dispatcher-53345 ERROR - AwsBatchAsyncBackendJobExecutionActor [UUID(6d97fef4)GPPW.ApplyBQSR:15:1]: Error attempting to Execute
software.amazon.awssdk.services.batch.model.BatchException: Too Many Requests (Service: null; Status Code: 429; Request ID: cfc6e34e-d66c-11e8-be0b-dd778498cf15)
        at software.amazon.awssdk.core.http.pipeline.stages.HandleResponseStage.handleErrorResponse(HandleResponseStage.java:114)
        at software.amazon.awssdk.core.http.pipeline.stages.HandleResponseStage.handleResponse(HandleResponseStage.java:72)
        at software.amazon.awssdk.core.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:57)
        at software.amazon.awssdk.core.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:40)
        at software.amazon.awssdk.core.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:239)
        at software.amazon.awssdk.core.http.pipeline.stages.TimerExceptionHandlingStage.execute(TimerExceptionHandlingStage.java:40)
        at software.amazon.awssdk.core.http.pipeline.stages.TimerExceptionHandlingStage.execute(TimerExceptionHandlingStage.java:30)
        at software.amazon.awssdk.core.http.pipeline.stages.RetryableStage$RetryExecutor.doExecute(RetryableStage.java:139)
        at software.amazon.awssdk.core.http.pipeline.stages.RetryableStage$RetryExecutor.execute(RetryableStage.java:105)
        at software.amazon.awssdk.core.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:66)
        at software.amazon.awssdk.core.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:47)
        at software.amazon.awssdk.core.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:239)
        at software.amazon.awssdk.core.http.StreamManagingStage.execute(StreamManagingStage.java:56)
        at software.amazon.awssdk.core.http.StreamManagingStage.execute(StreamManagingStage.java:42)
        at software.amazon.awssdk.core.http.pipeline.stages.ClientExecutionTimedStage.executeWithTimer(ClientExecutionTimedStage.java:71)
        at software.amazon.awssdk.core.http.pipeline.stages.ClientExecutionTimedStage.execute(ClientExecutionTimedStage.java:55)
        at software.amazon.awssdk.core.http.pipeline.stages.ClientExecutionTimedStage.execute(ClientExecutionTimedStage.java:39)
        at software.amazon.awssdk.core.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:239)
        at software.amazon.awssdk.core.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:239)
        at software.amazon.awssdk.core.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:35)
        at software.amazon.awssdk.core.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:24)
        at software.amazon.awssdk.core.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:281)
        at software.amazon.awssdk.core.client.SyncClientHandlerImpl.doInvoke(SyncClientHandlerImpl.java:149)
        at software.amazon.awssdk.core.client.SyncClientHandlerImpl.invoke(SyncClientHandlerImpl.java:131)
        at software.amazon.awssdk.core.client.SyncClientHandlerImpl.execute(SyncClientHandlerImpl.java:100)
        at software.amazon.awssdk.core.client.SyncClientHandlerImpl.execute(SyncClientHandlerImpl.java:76)
        at software.amazon.awssdk.core.client.SdkClientHandler.execute(SdkClientHandler.java:45)
        at software.amazon.awssdk.services.batch.DefaultBatchClient.registerJobDefinition(DefaultBatchClient.java:644)
        at cromwell.backend.impl.aws.AwsBatchJob.$anonfun$createDefinition$2(AwsBatchJob.scala:198)
        at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:85)
        at cats.effect.internals.IORunLoop$.startCancelable(IORunLoop.scala:41)
        at cats.effect.internals.IOBracket$BracketStart.run(IOBracket.scala:74)
        at cats.effect.internals.Trampoline.cats$effect$internals$Trampoline$$immediateLoop(Trampoline.scala:70)
        at cats.effect.internals.Trampoline.startLoop(Trampoline.scala:36)
        at cats.effect.internals.TrampolineEC$JVMTrampoline.super$startLoop(TrampolineEC.scala:93)
        at cats.effect.internals.TrampolineEC$JVMTrampoline.$anonfun$startLoop$1(TrampolineEC.scala:93)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
        at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
        at cats.effect.internals.TrampolineEC$JVMTrampoline.startLoop(TrampolineEC.scala:93)
        at cats.effect.internals.Trampoline.execute(Trampoline.scala:43) 
        at cats.effect.internals.TrampolineEC.execute(TrampolineEC.scala:44)
        at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:60)
        at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:41)
        at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:134)
        at cats.effect.internals.IORunLoop$.start(IORunLoop.scala:34)
        at cats.effect.internals.IOBracket$.$anonfun$apply$1(IOBracket.scala:36)
        at cats.effect.internals.IOBracket$.$anonfun$apply$1$adapted(IOBracket.scala:33)
        at cats.effect.internals.IORunLoop$RestartCallback.start(IORunLoop.scala:328)
        at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:117)
        at cats.effect.internals.IORunLoop$.start(IORunLoop.scala:34)
        at cats.effect.IO.unsafeRunAsync(IO.scala:258)
        at cats.effect.IO.unsafeToFuture(IO.scala:345)
        at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.executeAsync(AwsBatchAsyncBackendJobExecutionActor.scala:342)
        at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover(StandardAsyncExecutionActor.scala:943)
        at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover$(StandardAsyncExecutionActor.scala:935)
        at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.executeOrRecover(AwsBatchAsyncBackendJobExecutionActor.scala:74)
        at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustExecuteOrRecover$1(AsyncBackendJobExecutionActor.scala:65)
        at cromwell.core.retry.Retry$.withRetry(Retry.scala:38)
        at cromwell.backend.async.AsyncBackendJobExecutionActor.withRetry(AsyncBackendJobExecutionActor.scala:61)
        at cromwell.backend.async.AsyncBackendJobExecutionActor.cromwell$backend$async$AsyncBackendJobExecutionActor$$robustExecuteOrRecover(AsyncBackendJobExecutionActor.scala:65)
        at cromwell.backend.async.AsyncBackendJobExecutionActor$$anonfun$receive$1.applyOrElse(AsyncBackendJobExecutionActor.scala:88)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
        at akka.actor.Actor.aroundReceive(Actor.scala:517)
        at akka.actor.Actor.aroundReceive$(Actor.scala:515)
        at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.aroundReceive(AwsBatchAsyncBackendJobExecutionActor.scala:74)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
        at akka.actor.ActorCell.invoke(ActorCell.scala:557)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
        at akka.dispatch.Mailbox.run(Mailbox.scala:225)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

I understand why this is happening - Cromwell is simply sending too many requests to AWS Batch over a short space of time. However the limit is apparently 500 per second (https://forums.aws.amazon.com/thread.jspa?messageID=708581), so perhaps Cromwell is doing something unusual.

In the short term, I think the best solution is to catch this exception, sleep for a while, and then continue sending requests.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:41 (16 by maintainers)

github_iconTop GitHub Comments

5reactions
TimurIscommented, Nov 28, 2018

Simple workaround - add the following block into the configuration file:

system {
  job-rate-control {
    jobs = 1
    per = 1 second
  }
}

Drastically improved the situation for me

2reactions
multimericcommented, Nov 3, 2018

So, setting concurrent-job-limit to 8 did resolve my issue (and I hit another, unrelated issue instead).

However, I don’t believe this is the ideal way to resolve this. I (and I assume most other users) want maximum concurrency with my jobs, we just want to avoid this error. If we caught this Too Many Requests error, and just waited for a few seconds before retrying these requests, it would surely resolve this issue in a cleaner way.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resolve the "Getting error Too Many Requests" error in AWS ...
My AWS Batch job failed, and I received a "Getting error Too Many Requests" error. How can I resolve this error?
Read more >
Batch — Boto3 Docs 1.26.31 documentation - Amazon AWS
Using Batch, you can run batch computing workloads on the Amazon Web ... be an even multiple of 0.25 . cpu can be...
Read more >
13 AWS Lambda design considerations you need to know about
Batching is particularly useful in high transaction environments. SQS and Kinesis streams are some services that offer batching messages, sending the batch to ......
Read more >
Deployment with AWS Batch — Kedro 0.18.4 documentation
AWS Batch helps you run massively parallel Kedro pipelines in a ... _run( # pylint: disable=too-many-locals,useless-suppression self, pipeline: Pipeline, ...
Read more >
Request limits and throttling - Azure Resource Manager
When you reach the limit, you receive the HTTP status code 429 Too many requests. The response includes a Retry-After value, which specifies ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found