question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Workflows occasionally fail due to timeouts on read_* functions

See original GitHub issue

A recent review of Travis test failures revealed that some workflows were failing due to timeouts on functions like read_lines() or read_int() timing out:

Bad output 'int_reader.int': Failed to read_int(""gs://cloud-cromwell-dev/cromwell_execution/travis/globs/57f6e677-c2aa-4d96-bf33-9591fce20da7/call-int_reader/shard-3/stdout"") (reason 1 of 1): Futures timed out after [10 seconds]
        at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$handleExecutionSuccess$1(StandardAsyncExecutionActor.scala:851)```

It's possible that being queued in the I/O actor can take longer than the 10s timeout and thus that is the issue. It's possible this timeout needs to be raised or output evaluation needs to be retried, but this needs a fix as the outputs being evaluated already exist, so this is a bad failure mode.

AC: Depending on the potential causes for such behavior, either retry this evaluation, raise the timeout or explore another solution to ensure that jobs dont fail because of this timeout. 

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:13 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
illusionalcommented, Apr 21, 2021

We just reduced our use of read functions, usually removing globs helped a lot.

1reaction
DavyCatscommented, Oct 23, 2018

Oh right, I forgot about this comment… sorry 😅 The issue turned out to be with the call-caching strategy we were using. Because there were a lot of files being created, cromwell needed to do a large amount of hashing, which used up all of the available CPUs eventually leading to the timeouts. We changed the call-caching strategy and are no now longer running into this error.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Increase Timeout for running workflows? - VertiGIS Community
Hi Kevin, We were running into similar timeout problems with a workflow that calls another workflow which then buffers points. In our scenario...
Read more >
Workflow fails "Error occurred" randomly - Nintex Community
I have a client where workflow fails for some items sometimes. Sometimes it goes through ... I suppose it might be timeout at...
Read more >
The 4 Types of Activity timeouts - Temporal
Step 1 - Workflow Worker. An activity SimpleActivity is first invoked inside a Workflow Worker on Task Queue sampleTaskQueue . The precise ...
Read more >
Troubleshoot and diagnose workflow failures - Azure Logic Apps
How to troubleshoot and diagnose problems, errors, and failures in your workflows in Azure Logic Apps.
Read more >
Timeout property if the SAP session has hang or disconnected
However, many other users don't expect the timeout to be enforced at the expense of failing a workflow that could succeed, given a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found