Docker execution will hang waiting for the CID file id the container cannot be created
See original GitHub issueI ran into a problem with cwltool running as part of a Toil workflow. It tried and failed to launch a Docker container, and then sat there forever, with an unreaped, defunct child docker
process. Attaching gdb
showed the Python process sleeping. Here’s the relevant part of the Toil worker log:
[2021-02-17T10:08:41-0800] [MainThread] [D] [toil.cwl.cwltoil] Runtime Context environment: {}
[2021-02-17T10:08:41-0800] [MainThread] [D] [toil.cwl.cwltoil] Running CWL job: {'r': {'location': 'toilfs:2:0:files/for-job/kind-CWLJob/instance-wflk78uk/file-d57a54b0cb6f46c9803b5dea2821d231/out.txt', 'basename': 'out.txt', 'nameroot': 'out', 'nameext': '.txt', 'class': 'File', 'checksum': 'sha1$a3db5c13ff90a36963278c6a39e4ee3c22e2a436', 'size': 2, 'http://commonwl.org/cwltool#generation': 0}, 'script': ordereddict([('class', 'File'), ('location', 'file:///public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tests/updateval.py')])}
[2021-02-17T10:08:41-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] initializing from file:///public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tests/updateval_inplace.cwl
[2021-02-17T10:08:41-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] {
"r": {
"location": "toilfs:2:0:files/for-job/kind-CWLJob/instance-wflk78uk/file-d57a54b0cb6f46c9803b5dea2821d231/out.txt",
"basename": "out.txt",
"nameroot": "out",
"nameext": ".txt",
"class": "File",
"checksum": "sha1$a3db5c13ff90a36963278c6a39e4ee3c22e2a436",
"size": 2,
"http://commonwl.org/cwltool#generation": 0
},
"script": {
"class": "File",
"location": "file:///public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tests/updateval.py",
"basename": "updateval.py",
"nameroot": "updateval",
"nameext": ".py",
"size": 99
}
}
[2021-02-17T10:08:42-0800] [MainThread] [D] [toil.jobStores.abstractJobStore] Unable to import 'toil.jobStores.googleJobStore' as is expected if the corresponding extra was omitted at installation time.
[2021-02-17T10:08:42-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] path mappings is {
"toilfs:2:0:files/for-job/kind-CWLJob/instance-wflk78uk/file-d57a54b0cb6f46c9803b5dea2821d231/out.txt": [
"/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/tmpbyfabii6.tmp",
"/CnduBo/out.txt",
"WritableFile",
false
],
"file:///public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tests/updateval.py": [
"/data/tmp/tmpm20vzgqc/files/no-job/file-53f5505d4451415c9f0c56c0e2fb8b0b/updateval.py",
"/var/lib/cwl/stg1487b8f7-f0c4-44cb-9fb6-2b380ff1edf5/updateval.py",
"File",
true
]
}
[2021-02-17T10:08:42-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] command line bindings is [
{
"position": [
0,
0
],
"datum": "python"
},
{
"position": [
0,
1
],
"valueFrom": "$(inputs.script)"
},
{
"position": [
0,
2
],
"valueFrom": "$(inputs.r.basename)"
}
]
[2021-02-17T10:08:59-0800] [MainThread] [I] [cwltool] ['docker', 'pull', 'python:2.7.15-alpine3.7']
2.7.15-alpine3.7: Pulling from library/python
48ecbb6b270e: Pulling fs layer
81f9ab63a5a5: Pulling fs layer
d11afbf926bd: Pulling fs layer
502a70b94b66: Pulling fs layer
502a70b94b66: Waiting
81f9ab63a5a5: Verifying Checksum
81f9ab63a5a5: Download complete
48ecbb6b270e: Verifying Checksum
48ecbb6b270e: Download complete
502a70b94b66: Verifying Checksum
502a70b94b66: Download complete
d11afbf926bd: Verifying Checksum
d11afbf926bd: Download complete
48ecbb6b270e: Pull complete
81f9ab63a5a5: Pull complete
d11afbf926bd: Pull complete
502a70b94b66: Pull complete
Digest: sha256:95bdfc0e9fbf57ee252ede6fa3d81dc5d7739aab6b867558f22d06b1c1d9ad81
Status: Downloaded newer image for python:2.7.15-alpine3.7
[2021-02-17T10:09:10-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] initial work dir {
"toilfs:2:0:files/for-job/kind-CWLJob/instance-wflk78uk/file-d57a54b0cb6f46c9803b5dea2821d231/out.txt": [
"/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/tmpaf9vmqmz.tmp",
"/CnduBo/out.txt",
"WritableFile",
true
]
}
[2021-02-17T10:09:10-0800] [MainThread] [W] [cwltool] [job updateval_inplace.cwl] Skipping Docker software container '--memory' limit despite presence of ResourceRequirement with ramMin and/or ramMax setting. Consider running with --strict-memory-limit for increased portability assurance.
[2021-02-17T10:09:10-0800] [MainThread] [I] [cwltool] [job updateval_inplace.cwl] /public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/t1cuciawz/tmp-outecotb164$ docker \
run \
-i \
--mount=type=bind,source=/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/t1cuciawz/tmp-outecotb164,target=/CnduBo \
--mount=type=bind,source=/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/t5p82nbyd_3dptcqo,target=/tmp \
--mount=type=bind,source=/data/tmp/tmpm20vzgqc/files/no-job/file-53f5505d4451415c9f0c56c0e2fb8b0b/updateval.py,target=/var/lib/cwl/stg1487b8f7-f0c4-44cb-9fb6-2b380ff1edf5/updateval.py,readonly \
--mount=type=bind,source=/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/tmpaf9vmqmz.tmp,target=/CnduBo/out.txt \
--workdir=/CnduBo \
--read-only=true \
--net=none \
--user=1974:2000 \
--rm \
--env=TMPDIR=/tmp \
--env=HOME=/CnduBo \
--cidfile=/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/t5p82nbydpbcbr5pz/20210217100910-977042.cid \
python:2.7.15-alpine3.7 \
python \
/var/lib/cwl/stg1487b8f7-f0c4-44cb-9fb6-2b380ff1edf5/updateval.py \
out.txt
docker: Error response from daemon: invalid mount config for type "bind": bind source path does not exist.
See 'docker run --help'.
I think that, because the Docker daemon never managed to create the container, the monitoring process here is stuck waiting for a CID file that will never come:
And because _job_popen()
runs the monitoring function to completion before checking to see if the child process being monitored is alive, this turns into a deadlock:
As for why the Docker call is failing, even when cwltool is supposed to be creating the directories being bind mounted, my guess is that I’m running over a not particularly consistent shared file system. When I went in to look, all four of the sources for the bind mounts existed; perhaps it’s possible for the Docker daemon to not see the directory creation as having happened yet, and the filesystem is only really guaranteeing that it won’t be able to make a file with the same name.
cwltool
should detect when the Docker daemon fails before making the CID file, instead of hanging. If possible, it should detect when the Docker daemon is complaining that it can’t see directories we made, and retry for a little bit of real time to allow any shared filesystems to settle.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
I would confine the retry logic to the docker call to create the container, instead of backing out to the whole job. It needs to re-run with the same input paths to give their creation a chance to become visible.
This issue has been mentioned on Common Workflow Language Discourse. There might be relevant details there:
https://cwl.discourse.group/t/docker-error-invalid-mount-config-for-type-bind/598/1