Silent delocalizing failure
See original GitHub issueHello! I’m trying to use dsub
with the --tasks
option to run an analysis in 20 chunks. Curiously, the *.logs
indicate that the script runs to completion for every task, but only some random subset execute the delocalizing. Furthermore, the tasks that don’t delocalize don’t throw any kind of error captured in the *.logs
. dstat -f
, however, identifies the tasks that failed.
Here’s an example of a success:
- create-time: '2019-07-25 02:16:39.297447'
dsub-version: v0-3-2
end-time: '2019-07-25 02:32:30.556849'
envs:
CHUNK: '3'
events:
- name: start
start-time: 2019-07-25 06:16:42.171100+00:00
- name: pulling-image
start-time: 2019-07-25 06:17:32.995391+00:00
- name: localizing-files
start-time: 2019-07-25 06:18:34.308943+00:00
- name: running-docker
start-time: 2019-07-25 06:18:36.658863+00:00
- name: delocalizing-files
start-time: 2019-07-25 06:32:24.497567+00:00
- name: ok
start-time: 2019-07-25 06:32:30.556849+00:00
input-recursives: {}
inputs:
INFILE: gs://haddath/sgosai/hff/data/FADS1_rep8detailed.txt
internal-id: projects/sabeti-encode/operations/1351805964445161078
job-id: python--sagergosai--190725-021637-18
job-name: python
labels: {}
last-update: '2019-07-25 02:32:30.556849'
logging: gs://haddath/sgosai/hff/logs/python--sagergosai--190725-021637-18.4.1.log
mounts: {}
output-recursives: {}
outputs:
OUTFILE: gs://haddath/sgosai/hff/data/FADS1_rep8__3_20.bed
provider: google-v2
provider-attributes:
accelerators: []
boot-disk-size: 250
cpu_platform: ''
disk-size: 200
disk-type: pd-standard
enable-stackdriver-monitoring: false
instance-name: google-pipelines-worker-fae4230d454b3f6e1038535cbcb0da50
machine-type: n1-standard-8
network: ''
preemptible: true
regions: []
service-account: default
subnetwork: ''
use_private_address: false
zone: us-west2-c
zones:
- us-central1-a
- us-central1-b
- us-central1-c
- us-central1-f
- us-east1-b
- us-east1-c
- us-east1-d
- us-east4-a
- us-east4-b
- us-east4-c
- us-west1-a
- us-west1-b
- us-west1-c
- us-west2-a
- us-west2-b
- us-west2-c
script: |-
#!/usr/bin/env bash
python /app/hcr-ff/call_peaks.py ${INFILE} ${OUTFILE} -ji ${CHUNK} -jr 20 -ws 100 -ss 100
script-name: python
start-time: '2019-07-25 02:16:42.171100'
status: SUCCESS
status-detail: Success
status-message: Success
task-attempt: 1
task-id: '4'
user-id: sagergosai
And a failure:
- create-time: '2019-07-25 02:16:39.576571'
dsub-version: v0-3-2
end-time: '2019-07-25 02:52:45.047989'
envs:
CHUNK: '4'
events:
- name: start
start-time: 2019-07-25 06:16:42.182994+00:00
- name: pulling-image
start-time: 2019-07-25 06:17:41.422799+00:00
- name: localizing-files
start-time: 2019-07-25 06:18:41.913631+00:00
- name: running-docker
start-time: 2019-07-25 06:18:44.379215+00:00
- name: The assigned worker has failed to complete the operation
start-time: 2019-07-25 06:52:43.907976+00:00
input-recursives: {}
inputs:
INFILE: gs://haddath/sgosai/hff/data/FADS1_rep8detailed.txt
internal-id: projects/sabeti-encode/operations/8834123416523977731
job-id: python--sagergosai--190725-021637-18
job-name: python
labels: {}
last-update: '2019-07-25 02:52:45.047989'
logging: gs://haddath/sgosai/hff/logs/python--sagergosai--190725-021637-18.5.1.log
mounts: {}
output-recursives: {}
outputs:
OUTFILE: gs://haddath/sgosai/hff/data/FADS1_rep8__4_20.bed
provider: google-v2
provider-attributes:
accelerators: []
boot-disk-size: 250
cpu_platform: ''
disk-size: 200
disk-type: pd-standard
enable-stackdriver-monitoring: false
instance-name: google-pipelines-worker-1d27f8b0a26375721946e521a550105a
machine-type: n1-standard-8
network: ''
preemptible: true
regions: []
service-account: default
subnetwork: ''
use_private_address: false
zone: us-east1-b
zones:
- us-central1-a
- us-central1-b
- us-central1-c
- us-central1-f
- us-east1-b
- us-east1-c
- us-east1-d
- us-east4-a
- us-east4-b
- us-east4-c
- us-west1-a
- us-west1-b
- us-west1-c
- us-west2-a
- us-west2-b
- us-west2-c
script: |-
#!/usr/bin/env bash
python /app/hcr-ff/call_peaks.py ${INFILE} ${OUTFILE} -ji ${CHUNK} -jr 20 -ws 100 -ss 100
script-name: python
start-time: '2019-07-25 02:16:42.182994'
status: FAILURE
status-detail: The assigned worker has failed to complete the operation
status-message: The assigned worker has failed to complete the operation
task-attempt: 1
task-id: '5'
user-id: sagergosai
dsub version: 0.3.2
Issue Analytics
- State:
- Created 4 years ago
- Comments:13 (5 by maintainers)
Top Results From Across the Web
Delocalization of the Unpaired Electron in the Quercetin Radical
A possible explanation for failing to detect a galangin radical is that the half-life of the galangin radical is even shorter than that...
Read more >Addressing failures in exascale computing - SAGE Journals
A silent fault may be masked; an SDC is an error caused by an unmasked silent ... Alternatively, the checkpoints may need to...
Read more >Unraveling the Silent Hydrolysis of Cyclic B–X/C C Isosteres
Should full π-delocalization occur in its central heterocyclic ring, the parent B–X/C═C isostere, 9-hydroxyphenanthrene (6), would display bond ...
Read more >Electron Delocalization Explains much of the Branching and ...
We do not claim to have found the sole explanation for protobranching stability but a sizable one. Our methods were mute with respect...
Read more >Beyond Silence, Obstacle and Stigma: Revisiting the 'Problem ...
The first error was common in early peacebuilding missions where difference was neglected altogether due to a belief in universalist ways of ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @sjgosai !
For this last example, there is no indication that delocalization had actually started. It looks like “Worker” on the node failed to check in prior to delocalization. Even though your user-command did finish, it is very possible that you were right on the edge of being out of memory and the OOM killer may have killed the Pipelines API Worker.
Our first recommendation here is to increase the amount of memory available on the VM. See if that makes a difference in the success-without-retry rate.
Thanks @mbookman it seems like that’s working.
I’m just a bit surprised because I had originally tested my code with my test data on a ~
n1-highcpu-8
~ VM and had assumed an1-standard-8
would give me enough wiggle. Then1-highmem-8
worked perfectly though, so it must’ve been memory related. I’m going to test a few more times for completeness, but I think this probably fixed it.Thanks for your help!
Edit: I lied I didn’t test on
n1-highcpu-8
it was a custom with 10 cores and 20 GB ram.