question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FireCloud - cromwell losing jobs?

See original GitHub issue

We are seeing infrequent situations where JES says a call is done but cromwell thinks it is still running.

I see the call starting in the logs:

2017-02-08 18:55:58,500 cromwell-system-akka.dispatchers.engine-dispatcher-963 INFO - WorkflowExecutionActor-fa7e25a2-f51f-4763-9f8a-5a2e5cd1c954 [UUID(fa7e25a2)]: Starting calls: pon_gatk_workflow.PadTargets:NA:1

Then the only suspicious things I see later in the logs are these messages (which could be completely unrelated however they do start to appear 2.5 minutes after the job completes on JES):

2017-02-08 19:17:07,588 cromwell-system-akka.dispatchers.backend-dispatcher-424 INFO - The JES polling actor Actor[akka://cromwell-system/user/cromwell-service/$b/$a/$Enn#762671444] unexpectedly terminated while conducting 100 polls. Making a new one...
java.lang.NullPointerException: null
2017-02-08 19:17:07,588 cromwell-system-akka.dispatchers.backend-dispatcher-246 ERROR - null

For future reference by a FireCloud admin the operations id is ELSl1PihKxjdhp-6gvr3weYBILma7PWMHyoPcHJvZHVjdGlvblF1ZXVl and the workflow id is fa7e25a2-f51f-4763-9f8a-5a2e5cd1c954 and the workflow was aborted 3PM Feb 8.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
kcibulcommented, Feb 10, 2017

Given the potential impact of this, I’d like to prioritize this @geoffjentry @katevoss . I think the the new policy is to add the priority label, right? and bug wise, if confirmed, I’d say this is a P2 as there is a painful workaround (but still a priority ticket).

0reactions
ruchimcommented, Mar 6, 2017

Spoke to @dvoet and he mentioned that since FC hasn’t seen any more of the same symptoms (NPE’s and jobs getting lost) for some time now, we can consider this issue resolved for now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Something is rotten in the state of WorkflowStore (maybe)?
the firecloud thingy is “worked around”. cromwell 1 and 2 were restarted, and both started running jobs again.
Read more >
Troubleshooting in FireCloud - Legacy GATK Forum
In this document we'll go over some basic strategies to investigate failed workflows on FireCloud. This isn't a guide for solving all errors...
Read more >
output directory is missing? /cromwell_root/script: line 103
I notice that after the terra job completes the execution directory does not include the salmon.out directory. This is really strange.
Read more >
Thomas Cromwell | Tower of London - Historic Royal Palaces
The Cardinal mostly employed Cromwell on legal business, including the dissolution of some small religious houses in order to pay for his new...
Read more >
Ask the FireCloud Team — GATK-Forum
Coverage.bai-0=/cromwell_root/5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/ ... Next Firecloud jobs stopping unexpectedly - PAPI error code 10.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found