Workflows occasionally fail due to timeouts on read_* functions
See original GitHub issueA recent review of Travis test failures revealed that some workflows were failing due to timeouts on functions like read_lines() or read_int() timing out:
Bad output 'int_reader.int': Failed to read_int(""gs://cloud-cromwell-dev/cromwell_execution/travis/globs/57f6e677-c2aa-4d96-bf33-9591fce20da7/call-int_reader/shard-3/stdout"") (reason 1 of 1): Futures timed out after [10 seconds]
at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$handleExecutionSuccess$1(StandardAsyncExecutionActor.scala:851)```
It's possible that being queued in the I/O actor can take longer than the 10s timeout and thus that is the issue. It's possible this timeout needs to be raised or output evaluation needs to be retried, but this needs a fix as the outputs being evaluated already exist, so this is a bad failure mode.
AC: Depending on the potential causes for such behavior, either retry this evaluation, raise the timeout or explore another solution to ensure that jobs dont fail because of this timeout.
Issue Analytics
- State:
- Created 5 years ago
- Comments:13 (4 by maintainers)
Top Results From Across the Web
Increase Timeout for running workflows? - VertiGIS Community
Hi Kevin, We were running into similar timeout problems with a workflow that calls another workflow which then buffers points. In our scenario...
Read more >Workflow fails "Error occurred" randomly - Nintex Community
I have a client where workflow fails for some items sometimes. Sometimes it goes through ... I suppose it might be timeout at...
Read more >The 4 Types of Activity timeouts - Temporal
Step 1 - Workflow Worker. An activity SimpleActivity is first invoked inside a Workflow Worker on Task Queue sampleTaskQueue . The precise ...
Read more >Troubleshoot and diagnose workflow failures - Azure Logic Apps
How to troubleshoot and diagnose problems, errors, and failures in your workflows in Azure Logic Apps.
Read more >Timeout property if the SAP session has hang or disconnected
However, many other users don't expect the timeout to be enforced at the expense of failing a workflow that could succeed, given a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
We just reduced our use of read functions, usually removing globs helped a lot.
Oh right, I forgot about this comment… sorry 😅 The issue turned out to be with the call-caching strategy we were using. Because there were a lot of files being created, cromwell needed to do a large amount of hashing, which used up all of the available CPUs eventually leading to the timeouts. We changed the call-caching strategy and are no now longer running into this error.