question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Docker execution will hang waiting for the CID file id the container cannot be created

See original GitHub issue

I ran into a problem with cwltool running as part of a Toil workflow. It tried and failed to launch a Docker container, and then sat there forever, with an unreaped, defunct child docker process. Attaching gdb showed the Python process sleeping. Here’s the relevant part of the Toil worker log:

[2021-02-17T10:08:41-0800] [MainThread] [D] [toil.cwl.cwltoil] Runtime Context environment: {}
[2021-02-17T10:08:41-0800] [MainThread] [D] [toil.cwl.cwltoil] Running CWL job: {'r': {'location': 'toilfs:2:0:files/for-job/kind-CWLJob/instance-wflk78uk/file-d57a54b0cb6f46c9803b5dea2821d231/out.txt', 'basename': 'out.txt', 'nameroot': 'out', 'nameext': '.txt', 'class': 'File', 'checksum': 'sha1$a3db5c13ff90a36963278c6a39e4ee3c22e2a436', 'size': 2, 'http://commonwl.org/cwltool#generation': 0}, 'script': ordereddict([('class', 'File'), ('location', 'file:///public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tests/updateval.py')])}
[2021-02-17T10:08:41-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] initializing from file:///public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tests/updateval_inplace.cwl
[2021-02-17T10:08:41-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] {
    "r": {
        "location": "toilfs:2:0:files/for-job/kind-CWLJob/instance-wflk78uk/file-d57a54b0cb6f46c9803b5dea2821d231/out.txt",
        "basename": "out.txt",
        "nameroot": "out",
        "nameext": ".txt",
        "class": "File",
        "checksum": "sha1$a3db5c13ff90a36963278c6a39e4ee3c22e2a436",
        "size": 2,
        "http://commonwl.org/cwltool#generation": 0
    },
    "script": {
        "class": "File",
        "location": "file:///public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tests/updateval.py",
        "basename": "updateval.py",
        "nameroot": "updateval",
        "nameext": ".py",
        "size": 99
    }
}
[2021-02-17T10:08:42-0800] [MainThread] [D] [toil.jobStores.abstractJobStore] Unable to import 'toil.jobStores.googleJobStore' as is expected if the corresponding extra was omitted at installation time.
[2021-02-17T10:08:42-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] path mappings is {
    "toilfs:2:0:files/for-job/kind-CWLJob/instance-wflk78uk/file-d57a54b0cb6f46c9803b5dea2821d231/out.txt": [
        "/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/tmpbyfabii6.tmp",
        "/CnduBo/out.txt",
        "WritableFile",
        false
    ],
    "file:///public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tests/updateval.py": [
        "/data/tmp/tmpm20vzgqc/files/no-job/file-53f5505d4451415c9f0c56c0e2fb8b0b/updateval.py",
        "/var/lib/cwl/stg1487b8f7-f0c4-44cb-9fb6-2b380ff1edf5/updateval.py",
        "File",
        true
    ]
}
[2021-02-17T10:08:42-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] command line bindings is [
    {
        "position": [
            0,
            0
        ],
        "datum": "python"
    },
    {
        "position": [
            0,
            1
        ],
        "valueFrom": "$(inputs.script)"
    },
    {
        "position": [
            0,
            2
        ],
        "valueFrom": "$(inputs.r.basename)"
    }
]
[2021-02-17T10:08:59-0800] [MainThread] [I] [cwltool] ['docker', 'pull', 'python:2.7.15-alpine3.7']
2.7.15-alpine3.7: Pulling from library/python
48ecbb6b270e: Pulling fs layer
81f9ab63a5a5: Pulling fs layer
d11afbf926bd: Pulling fs layer
502a70b94b66: Pulling fs layer
502a70b94b66: Waiting
81f9ab63a5a5: Verifying Checksum
81f9ab63a5a5: Download complete
48ecbb6b270e: Verifying Checksum
48ecbb6b270e: Download complete
502a70b94b66: Verifying Checksum
502a70b94b66: Download complete
d11afbf926bd: Verifying Checksum
d11afbf926bd: Download complete
48ecbb6b270e: Pull complete
81f9ab63a5a5: Pull complete
d11afbf926bd: Pull complete
502a70b94b66: Pull complete
Digest: sha256:95bdfc0e9fbf57ee252ede6fa3d81dc5d7739aab6b867558f22d06b1c1d9ad81
Status: Downloaded newer image for python:2.7.15-alpine3.7
[2021-02-17T10:09:10-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] initial work dir {
    "toilfs:2:0:files/for-job/kind-CWLJob/instance-wflk78uk/file-d57a54b0cb6f46c9803b5dea2821d231/out.txt": [
        "/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/tmpaf9vmqmz.tmp",
        "/CnduBo/out.txt",
        "WritableFile",
        true
    ]
}
[2021-02-17T10:09:10-0800] [MainThread] [W] [cwltool] [job updateval_inplace.cwl] Skipping Docker software container '--memory' limit despite presence of ResourceRequirement with ramMin and/or ramMax setting. Consider running with --strict-memory-limit for increased portability assurance.
[2021-02-17T10:09:10-0800] [MainThread] [I] [cwltool] [job updateval_inplace.cwl] /public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/t1cuciawz/tmp-outecotb164$ docker \
    run \
    -i \
    --mount=type=bind,source=/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/t1cuciawz/tmp-outecotb164,target=/CnduBo \
    --mount=type=bind,source=/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/t5p82nbyd_3dptcqo,target=/tmp \
    --mount=type=bind,source=/data/tmp/tmpm20vzgqc/files/no-job/file-53f5505d4451415c9f0c56c0e2fb8b0b/updateval.py,target=/var/lib/cwl/stg1487b8f7-f0c4-44cb-9fb6-2b380ff1edf5/updateval.py,readonly \
    --mount=type=bind,source=/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/tmpaf9vmqmz.tmp,target=/CnduBo/out.txt \
    --workdir=/CnduBo \
    --read-only=true \
    --net=none \
    --user=1974:2000 \
    --rm \
    --env=TMPDIR=/tmp \
    --env=HOME=/CnduBo \
    --cidfile=/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/t5p82nbydpbcbr5pz/20210217100910-977042.cid \
    python:2.7.15-alpine3.7 \
    python \
    /var/lib/cwl/stg1487b8f7-f0c4-44cb-9fb6-2b380ff1edf5/updateval.py \
    out.txt
docker: Error response from daemon: invalid mount config for type "bind": bind source path does not exist.
See 'docker run --help'.

I think that, because the Docker daemon never managed to create the container, the monitoring process here is stuck waiting for a CID file that will never come:

https://github.com/common-workflow-language/cwltool/blob/70dafe0d36ac40246cb2629c08993e68ef36165d/cwltool/job.py#L848-L864

And because _job_popen() runs the monitoring function to completion before checking to see if the child process being monitored is alive, this turns into a deadlock:

https://github.com/common-workflow-language/cwltool/blob/70dafe0d36ac40246cb2629c08993e68ef36165d/cwltool/job.py#L972-L974

As for why the Docker call is failing, even when cwltool is supposed to be creating the directories being bind mounted, my guess is that I’m running over a not particularly consistent shared file system. When I went in to look, all four of the sources for the bind mounts existed; perhaps it’s possible for the Docker daemon to not see the directory creation as having happened yet, and the filesystem is only really guaranteeing that it won’t be able to make a file with the same name.

cwltool should detect when the Docker daemon fails before making the CID file, instead of hanging. If possible, it should detect when the Docker daemon is complaining that it can’t see directories we made, and retry for a little bit of real time to allow any shared filesystems to settle.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
adamnovakcommented, Feb 18, 2021

I would confine the retry logic to the docker call to create the container, instead of backing out to the whole job. It needs to re-run with the same input paths to give their creation a chance to become visible.

0reactions
cwl-botcommented, May 10, 2022

This issue has been mentioned on Common Workflow Language Discourse. There might be relevant details there:

https://cwl.discourse.group/t/docker-error-invalid-mount-config-for-type-bind/598/1

Read more comments on GitHub >

github_iconTop Results From Across the Web

Docker - Cannot remove dead container - Stack Overflow
I simply ran docker rmi <image-name> for the image from whence the dead container came. A docker ps -a then showed the dead...
Read more >
docker run - Docker Documentation
The cidfile flag makes Docker attempt to create a new file and write the container ID to it. If the file exists already,...
Read more >
Fix list for IBM WebSphere Application Server V8.5
Java client hang when queue manager is quiescing as new connection attempts are made. IT24521, Activation Specifications that consume request messages without ...
Read more >
| notebook.community
The idea is that docker containers wrap a piece of software or application in a ... REPOSITORY TAG IMAGE ID CREATED SIZE nvidia/cuda...
Read more >
Troubleshooting errors with Docker commands when using ...
If the local disk on which you're running docker pull is full, then the SHA-1 hash calculated on the local file may be...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found