Too Many Requests with AWS Batch backend
See original GitHub issue- Backend: AWS
- Cromwell: 36
When running certain highly parallel WDL workflows, I’m getting the error cromwell.core.CromwellFatalException: software.amazon.awssdk.services.batch.model.BatchException: Too Many Requests (Service: null; Status Code: 429; Request ID: cffe6e45-d66c-11e8-a1df-05402551b0ba)
. The specific case where this happens is in the gatk3-data-processing
workflow, when running the ApplyBQSR
task, which is run in parallel over some calculated intervals.
The full error trace I get is:
2018-10-23 02:39:07,631 cromwell-system-akka.dispatchers.backend-dispatcher-53345 ERROR - AwsBatchAsyncBackendJobExecutionActor [UUID(6d97fef4)GPPW.ApplyBQSR:15:1]: Error attempting to Execute
software.amazon.awssdk.services.batch.model.BatchException: Too Many Requests (Service: null; Status Code: 429; Request ID: cfc6e34e-d66c-11e8-be0b-dd778498cf15)
at software.amazon.awssdk.core.http.pipeline.stages.HandleResponseStage.handleErrorResponse(HandleResponseStage.java:114)
at software.amazon.awssdk.core.http.pipeline.stages.HandleResponseStage.handleResponse(HandleResponseStage.java:72)
at software.amazon.awssdk.core.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:57)
at software.amazon.awssdk.core.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:40)
at software.amazon.awssdk.core.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:239)
at software.amazon.awssdk.core.http.pipeline.stages.TimerExceptionHandlingStage.execute(TimerExceptionHandlingStage.java:40)
at software.amazon.awssdk.core.http.pipeline.stages.TimerExceptionHandlingStage.execute(TimerExceptionHandlingStage.java:30)
at software.amazon.awssdk.core.http.pipeline.stages.RetryableStage$RetryExecutor.doExecute(RetryableStage.java:139)
at software.amazon.awssdk.core.http.pipeline.stages.RetryableStage$RetryExecutor.execute(RetryableStage.java:105)
at software.amazon.awssdk.core.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:66)
at software.amazon.awssdk.core.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:47)
at software.amazon.awssdk.core.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:239)
at software.amazon.awssdk.core.http.StreamManagingStage.execute(StreamManagingStage.java:56)
at software.amazon.awssdk.core.http.StreamManagingStage.execute(StreamManagingStage.java:42)
at software.amazon.awssdk.core.http.pipeline.stages.ClientExecutionTimedStage.executeWithTimer(ClientExecutionTimedStage.java:71)
at software.amazon.awssdk.core.http.pipeline.stages.ClientExecutionTimedStage.execute(ClientExecutionTimedStage.java:55)
at software.amazon.awssdk.core.http.pipeline.stages.ClientExecutionTimedStage.execute(ClientExecutionTimedStage.java:39)
at software.amazon.awssdk.core.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:239)
at software.amazon.awssdk.core.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:239)
at software.amazon.awssdk.core.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:35)
at software.amazon.awssdk.core.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:24)
at software.amazon.awssdk.core.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:281)
at software.amazon.awssdk.core.client.SyncClientHandlerImpl.doInvoke(SyncClientHandlerImpl.java:149)
at software.amazon.awssdk.core.client.SyncClientHandlerImpl.invoke(SyncClientHandlerImpl.java:131)
at software.amazon.awssdk.core.client.SyncClientHandlerImpl.execute(SyncClientHandlerImpl.java:100)
at software.amazon.awssdk.core.client.SyncClientHandlerImpl.execute(SyncClientHandlerImpl.java:76)
at software.amazon.awssdk.core.client.SdkClientHandler.execute(SdkClientHandler.java:45)
at software.amazon.awssdk.services.batch.DefaultBatchClient.registerJobDefinition(DefaultBatchClient.java:644)
at cromwell.backend.impl.aws.AwsBatchJob.$anonfun$createDefinition$2(AwsBatchJob.scala:198)
at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:85)
at cats.effect.internals.IORunLoop$.startCancelable(IORunLoop.scala:41)
at cats.effect.internals.IOBracket$BracketStart.run(IOBracket.scala:74)
at cats.effect.internals.Trampoline.cats$effect$internals$Trampoline$$immediateLoop(Trampoline.scala:70)
at cats.effect.internals.Trampoline.startLoop(Trampoline.scala:36)
at cats.effect.internals.TrampolineEC$JVMTrampoline.super$startLoop(TrampolineEC.scala:93)
at cats.effect.internals.TrampolineEC$JVMTrampoline.$anonfun$startLoop$1(TrampolineEC.scala:93)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
at cats.effect.internals.TrampolineEC$JVMTrampoline.startLoop(TrampolineEC.scala:93)
at cats.effect.internals.Trampoline.execute(Trampoline.scala:43)
at cats.effect.internals.TrampolineEC.execute(TrampolineEC.scala:44)
at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:60)
at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:41)
at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:134)
at cats.effect.internals.IORunLoop$.start(IORunLoop.scala:34)
at cats.effect.internals.IOBracket$.$anonfun$apply$1(IOBracket.scala:36)
at cats.effect.internals.IOBracket$.$anonfun$apply$1$adapted(IOBracket.scala:33)
at cats.effect.internals.IORunLoop$RestartCallback.start(IORunLoop.scala:328)
at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:117)
at cats.effect.internals.IORunLoop$.start(IORunLoop.scala:34)
at cats.effect.IO.unsafeRunAsync(IO.scala:258)
at cats.effect.IO.unsafeToFuture(IO.scala:345)
at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.executeAsync(AwsBatchAsyncBackendJobExecutionActor.scala:342)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover(StandardAsyncExecutionActor.scala:943)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover$(StandardAsyncExecutionActor.scala:935)
at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.executeOrRecover(AwsBatchAsyncBackendJobExecutionActor.scala:74)
at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustExecuteOrRecover$1(AsyncBackendJobExecutionActor.scala:65)
at cromwell.core.retry.Retry$.withRetry(Retry.scala:38)
at cromwell.backend.async.AsyncBackendJobExecutionActor.withRetry(AsyncBackendJobExecutionActor.scala:61)
at cromwell.backend.async.AsyncBackendJobExecutionActor.cromwell$backend$async$AsyncBackendJobExecutionActor$$robustExecuteOrRecover(AsyncBackendJobExecutionActor.scala:65)
at cromwell.backend.async.AsyncBackendJobExecutionActor$$anonfun$receive$1.applyOrElse(AsyncBackendJobExecutionActor.scala:88)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.aroundReceive(AwsBatchAsyncBackendJobExecutionActor.scala:74)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
at akka.actor.ActorCell.invoke(ActorCell.scala:557)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I understand why this is happening - Cromwell is simply sending too many requests to AWS Batch over a short space of time. However the limit is apparently 500 per second (https://forums.aws.amazon.com/thread.jspa?messageID=708581), so perhaps Cromwell is doing something unusual.
In the short term, I think the best solution is to catch this exception, sleep for a while, and then continue sending requests.
Issue Analytics
- State:
- Created 5 years ago
- Comments:41 (16 by maintainers)
Top GitHub Comments
Simple workaround - add the following block into the configuration file:
Drastically improved the situation for me
So, setting
concurrent-job-limit
to 8 did resolve my issue (and I hit another, unrelated issue instead).However, I don’t believe this is the ideal way to resolve this. I (and I assume most other users) want maximum concurrency with my jobs, we just want to avoid this error. If we caught this
Too Many Requests
error, and just waited for a few seconds before retrying these requests, it would surely resolve this issue in a cleaner way.