FireCloud - cromwell losing jobs?
See original GitHub issueWe are seeing infrequent situations where JES says a call is done but cromwell thinks it is still running.
I see the call starting in the logs:
2017-02-08 18:55:58,500 cromwell-system-akka.dispatchers.engine-dispatcher-963 INFO - WorkflowExecutionActor-fa7e25a2-f51f-4763-9f8a-5a2e5cd1c954 [UUID(fa7e25a2)]: Starting calls: pon_gatk_workflow.PadTargets:NA:1
Then the only suspicious things I see later in the logs are these messages (which could be completely unrelated however they do start to appear 2.5 minutes after the job completes on JES):
2017-02-08 19:17:07,588 cromwell-system-akka.dispatchers.backend-dispatcher-424 INFO - The JES polling actor Actor[akka://cromwell-system/user/cromwell-service/$b/$a/$Enn#762671444] unexpectedly terminated while conducting 100 polls. Making a new one...
java.lang.NullPointerException: null
2017-02-08 19:17:07,588 cromwell-system-akka.dispatchers.backend-dispatcher-246 ERROR - null
For future reference by a FireCloud admin the operations id is ELSl1PihKxjdhp-6gvr3weYBILma7PWMHyoPcHJvZHVjdGlvblF1ZXVl and the workflow id is fa7e25a2-f51f-4763-9f8a-5a2e5cd1c954 and the workflow was aborted 3PM Feb 8.
Issue Analytics
- State:
- Created 7 years ago
- Comments:10 (9 by maintainers)
Top Results From Across the Web
Something is rotten in the state of WorkflowStore (maybe)?
the firecloud thingy is “worked around”. cromwell 1 and 2 were restarted, and both started running jobs again.
Read more >Troubleshooting in FireCloud - Legacy GATK Forum
In this document we'll go over some basic strategies to investigate failed workflows on FireCloud. This isn't a guide for solving all errors...
Read more >output directory is missing? /cromwell_root/script: line 103
I notice that after the terra job completes the execution directory does not include the salmon.out directory. This is really strange.
Read more >Thomas Cromwell | Tower of London - Historic Royal Palaces
The Cardinal mostly employed Cromwell on legal business, including the dissolution of some small religious houses in order to pay for his new...
Read more >Ask the FireCloud Team — GATK-Forum
Coverage.bai-0=/cromwell_root/5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/ ... Next Firecloud jobs stopping unexpectedly - PAPI error code 10.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Given the potential impact of this, I’d like to prioritize this @geoffjentry @katevoss . I think the the new policy is to add the priority label, right? and bug wise, if confirmed, I’d say this is a P2 as there is a painful workaround (but still a priority ticket).
Spoke to @dvoet and he mentioned that since FC hasn’t seen any more of the same symptoms (NPE’s and jobs getting lost) for some time now, we can consider this issue resolved for now.