Intermittent failure on enable docker task
See original GitHub issueI’ve had several instances where running the setup playbook errors the first time running, and then running it immediately again it completes fine, or at least makes more progress. The step that most often causes the problem on the first run is at this point:
TASK: [docker | enable docker] ************************************************
[0;31mfailed: [test-control-03] => {"failed": true}[0m
[0;31mmsg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[0m
[0;31mfailed: [test-worker-001] => {"failed": true}[0m
[0;31mmsg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[0m
[0;31mfailed: [test-control-02] => {"failed": true}[0m
[0;31mmsg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[0m
[0;31mfailed: [test-edge-01] => {"failed": true}[0m
[0;31mmsg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[0m
[0;31mfailed: [test-edge-02] => {"failed": true}[0m
[0;31mmsg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[0m
[0;31mfailed: [test-control-01] => {"failed": true}[0m
[0;31mmsg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[0m
[0;31m
FATAL: all hosts have already failed -- aborting[0m
PLAY RECAP ********************************************************************
docker | enable docker ------------------------------------------------- 86.38s
docker | install docker packages --------------------------------------- 46.92s
common | install system utilities -------------------------------------- 21.94s
common | update setuptools and pip ------------------------------------- 19.65s
common | install distributive ------------------------------------------ 15.45s
consul-template | install consul-template ------------------------------- 9.30s
collectd | install collectd packages ------------------------------------ 7.43s
docker | install latest device-mapper-libs ------------------------------ 4.33s
common | enable EPEL repo ----------------------------------------------- 3.77s
common | install pip ---------------------------------------------------- 3.76s
After it errors out, if I check the nodes docker is enabled and running. Then if I re-run the playbook to install mantl it progresses and moves along, most often to completion on the second run.
Related, I’ve noticed an impact on how well the setup runs based on the virtual resources the nodes have. In my case, a 1 CPU, 4 GB RAM setup often errors out multiple times, and in some cases won’t work at all. I’m currently testing with 4CPU/8GB boxes and having no trouble.
Issue Analytics
- State:
- Created 8 years ago
- Comments:22 (21 by maintainers)
Top Results From Across the Web
Docker intermittently failing to start container - Stack Overflow
I'm using Docker with Jenkins to run Rspec tests, and I'm getting this intermittent error:.
Read more >Occasional docker build task failure when completing push ...
This appears to occur after push is complete and it it tagging the image in the container registry. It seems to be trying...
Read more >Intermittent auth failures when pulling docker images - gitlab ...
Since the release of gitlab 15 and gitlab-runner 15.0.0 (febb2a09) we've been getting intermittent auth failures when pulling docker images.
Read more >Intermittent loss of network connectivity - DockerEngine
When I boot my machine and start docker, then run a container, I can ping 172.17.0.1 from inside the container fine. At some...
Read more >Troubleshoot self-hosted runner - CircleCI
Troubleshoot container runner · Container fails to start due to disk space · Pod host node runs out of memory · Pod host...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
This continues to affect the master branch of Mantl.
@thomasvincent yes, deploying with the hostNetwork=true fix on a8cdf47 works for me with the latest on master.