Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Too Many Requests with AWS Batch backend

See original GitHub issue

Backend: AWS
Cromwell: 36

When running certain highly parallel WDL workflows, I’m getting the error cromwell.core.CromwellFatalException: software.amazon.awssdk.services.batch.model.BatchException: Too Many Requests (Service: null; Status Code: 429; Request ID: cffe6e45-d66c-11e8-a1df-05402551b0ba). The specific case where this happens is in the gatk3-data-processing workflow, when running the ApplyBQSR task, which is run in parallel over some calculated intervals.

The full error trace I get is:

2018-10-23 02:39:07,631 cromwell-system-akka.dispatchers.backend-dispatcher-53345 ERROR - AwsBatchAsyncBackendJobExecutionActor [UUID(6d97fef4)GPPW.ApplyBQSR:15:1]: Error attempting to Execute
software.amazon.awssdk.services.batch.model.BatchException: Too Many Requests (Service: null; Status Code: 429; Request ID: cfc6e34e-d66c-11e8-be0b-dd778498cf15)
        at software.amazon.awssdk.core.http.pipeline.stages.HandleResponseStage.handleErrorResponse(HandleResponseStage.java:114)
        at software.amazon.awssdk.core.http.pipeline.stages.HandleResponseStage.handleResponse(HandleResponseStage.java:72)
        at software.amazon.awssdk.core.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:57)
        at software.amazon.awssdk.core.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:40)
        at software.amazon.awssdk.core.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:239)
        at software.amazon.awssdk.core.http.pipeline.stages.TimerExceptionHandlingStage.execute(TimerExceptionHandlingStage.java:40)
        at software.amazon.awssdk.core.http.pipeline.stages.TimerExceptionHandlingStage.execute(TimerExceptionHandlingStage.java:30)
        at software.amazon.awssdk.core.http.pipeline.stages.RetryableStage$RetryExecutor.doExecute(RetryableStage.java:139)
        at software.amazon.awssdk.core.http.pipeline.stages.RetryableStage$RetryExecutor.execute(RetryableStage.java:105)
        at software.amazon.awssdk.core.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:66)
        at software.amazon.awssdk.core.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:47)
        at software.amazon.awssdk.core.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:239)
        at software.amazon.awssdk.core.http.StreamManagingStage.execute(StreamManagingStage.java:56)
        at software.amazon.awssdk.core.http.StreamManagingStage.execute(StreamManagingStage.java:42)
        at software.amazon.awssdk.core.http.pipeline.stages.ClientExecutionTimedStage.executeWithTimer(ClientExecutionTimedStage.java:71)
        at software.amazon.awssdk.core.http.pipeline.stages.ClientExecutionTimedStage.execute(ClientExecutionTimedStage.java:55)
        at software.amazon.awssdk.core.http.pipeline.stages.ClientExecutionTimedStage.execute(ClientExecutionTimedStage.java:39)
        at software.amazon.awssdk.core.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:239)
        at software.amazon.awssdk.core.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:239)
        at software.amazon.awssdk.core.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:35)
        at software.amazon.awssdk.core.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:24)
        at software.amazon.awssdk.core.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:281)
        at software.amazon.awssdk.core.client.SyncClientHandlerImpl.doInvoke(SyncClientHandlerImpl.java:149)
        at software.amazon.awssdk.core.client.SyncClientHandlerImpl.invoke(SyncClientHandlerImpl.java:131)
        at software.amazon.awssdk.core.client.SyncClientHandlerImpl.execute(SyncClientHandlerImpl.java:100)
        at software.amazon.awssdk.core.client.SyncClientHandlerImpl.execute(SyncClientHandlerImpl.java:76)
        at software.amazon.awssdk.core.client.SdkClientHandler.execute(SdkClientHandler.java:45)
        at software.amazon.awssdk.services.batch.DefaultBatchClient.registerJobDefinition(DefaultBatchClient.java:644)
        at cromwell.backend.impl.aws.AwsBatchJob.$anonfun$createDefinition$2(AwsBatchJob.scala:198)
        at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:85)
        at cats.effect.internals.IORunLoop$.startCancelable(IORunLoop.scala:41)
        at cats.effect.internals.IOBracket$BracketStart.run(IOBracket.scala:74)
        at cats.effect.internals.Trampoline.cats$effect$internals$Trampoline$$immediateLoop(Trampoline.scala:70)
        at cats.effect.internals.Trampoline.startLoop(Trampoline.scala:36)
        at cats.effect.internals.TrampolineEC$JVMTrampoline.super$startLoop(TrampolineEC.scala:93)
        at cats.effect.internals.TrampolineEC$JVMTrampoline.$anonfun$startLoop$1(TrampolineEC.scala:93)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
        at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
        at cats.effect.internals.TrampolineEC$JVMTrampoline.startLoop(TrampolineEC.scala:93)
        at cats.effect.internals.Trampoline.execute(Trampoline.scala:43) 
        at cats.effect.internals.TrampolineEC.execute(TrampolineEC.scala:44)
        at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:60)
        at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:41)
        at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:134)
        at cats.effect.internals.IORunLoop$.start(IORunLoop.scala:34)
        at cats.effect.internals.IOBracket$.$anonfun$apply$1(IOBracket.scala:36)
        at cats.effect.internals.IOBracket$.$anonfun$apply$1$adapted(IOBracket.scala:33)
        at cats.effect.internals.IORunLoop$RestartCallback.start(IORunLoop.scala:328)
        at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:117)
        at cats.effect.internals.IORunLoop$.start(IORunLoop.scala:34)
        at cats.effect.IO.unsafeRunAsync(IO.scala:258)
        at cats.effect.IO.unsafeToFuture(IO.scala:345)
        at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.executeAsync(AwsBatchAsyncBackendJobExecutionActor.scala:342)
        at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover(StandardAsyncExecutionActor.scala:943)
        at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover$(StandardAsyncExecutionActor.scala:935)
        at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.executeOrRecover(AwsBatchAsyncBackendJobExecutionActor.scala:74)
        at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustExecuteOrRecover$1(AsyncBackendJobExecutionActor.scala:65)
        at cromwell.core.retry.Retry$.withRetry(Retry.scala:38)
        at cromwell.backend.async.AsyncBackendJobExecutionActor.withRetry(AsyncBackendJobExecutionActor.scala:61)
        at cromwell.backend.async.AsyncBackendJobExecutionActor.cromwell$backend$async$AsyncBackendJobExecutionActor$$robustExecuteOrRecover(AsyncBackendJobExecutionActor.scala:65)
        at cromwell.backend.async.AsyncBackendJobExecutionActor$$anonfun$receive$1.applyOrElse(AsyncBackendJobExecutionActor.scala:88)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
        at akka.actor.Actor.aroundReceive(Actor.scala:517)
        at akka.actor.Actor.aroundReceive$(Actor.scala:515)
        at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.aroundReceive(AwsBatchAsyncBackendJobExecutionActor.scala:74)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
        at akka.actor.ActorCell.invoke(ActorCell.scala:557)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
        at akka.dispatch.Mailbox.run(Mailbox.scala:225)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

I understand why this is happening - Cromwell is simply sending too many requests to AWS Batch over a short space of time. However the limit is apparently 500 per second (https://forums.aws.amazon.com/thread.jspa?messageID=708581), so perhaps Cromwell is doing something unusual.

In the short term, I think the best solution is to catch this exception, sleep for a while, and then continue sending requests.

Issue Analytics

State:
Created 5 years ago
Comments:41 (16 by maintainers)

Top GitHub Comments

5reactions

TimurIscommented, Nov 28, 2018

Simple workaround - add the following block into the configuration file:

system {
  job-rate-control {
    jobs = 1
    per = 1 second
  }
}

Drastically improved the situation for me

2reactions

multimericcommented, Nov 3, 2018

So, setting concurrent-job-limit to 8 did resolve my issue (and I hit another, unrelated issue instead).

However, I don’t believe this is the ideal way to resolve this. I (and I assume most other users) want maximum concurrency with my jobs, we just want to avoid this error. If we caught this Too Many Requests error, and just waited for a few seconds before retrying these requests, it would surely resolve this issue in a cleaner way.

Top Results From Across the Web

Resolve the "Getting error Too Many Requests" error in AWS ...

My AWS Batch job failed, and I received a "Getting error Too Many Requests" error. How can I resolve this error?

Batch — Boto3 Docs 1.26.31 documentation - Amazon AWS

Using Batch, you can run batch computing workloads on the Amazon Web ... be an even multiple of 0.25 . cpu can be...

13 AWS Lambda design considerations you need to know about

Batching is particularly useful in high transaction environments. SQS and Kinesis streams are some services that offer batching messages, sending the batch to ......

Deployment with AWS Batch — Kedro 0.18.4 documentation

AWS Batch helps you run massively parallel Kedro pipelines in a ... _run( # pylint: disable=too-many-locals,useless-suppression self, pipeline: Pipeline, ...

Request limits and throttling - Azure Resource Manager

When you reach the limit, you receive the HTTP status code 429 Too many requests. The response includes a Retry-After value, which specifies ......