Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Intermittent failure on enable docker task

See original GitHub issue

I’ve had several instances where running the setup playbook errors the first time running, and then running it immediately again it completes fine, or at least makes more progress. The step that most often causes the problem on the first run is at this point:

TASK: [docker | enable docker] ************************************************ 
[0;31mfailed: [test-control-03] => {"failed": true}[0m
[0;31mmsg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[0m
[0;31mfailed: [test-worker-001] => {"failed": true}[0m
[0;31mmsg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[0m
[0;31mfailed: [test-control-02] => {"failed": true}[0m
[0;31mmsg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[0m
[0;31mfailed: [test-edge-01] => {"failed": true}[0m
[0;31mmsg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[0m
[0;31mfailed: [test-edge-02] => {"failed": true}[0m
[0;31mmsg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[0m
[0;31mfailed: [test-control-01] => {"failed": true}[0m
[0;31mmsg: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[0m
[0;31m
FATAL: all hosts have already failed -- aborting[0m

PLAY RECAP ******************************************************************** 
docker | enable docker ------------------------------------------------- 86.38s
docker | install docker packages --------------------------------------- 46.92s
common | install system utilities -------------------------------------- 21.94s
common | update setuptools and pip ------------------------------------- 19.65s
common | install distributive ------------------------------------------ 15.45s
consul-template | install consul-template ------------------------------- 9.30s
collectd | install collectd packages ------------------------------------ 7.43s
docker | install latest device-mapper-libs ------------------------------ 4.33s
common | enable EPEL repo ----------------------------------------------- 3.77s
common | install pip ---------------------------------------------------- 3.76s

After it errors out, if I check the nodes docker is enabled and running. Then if I re-run the playbook to install mantl it progresses and moves along, most often to completion on the second run.

Related, I’ve noticed an impact on how well the setup runs based on the virtual resources the nodes have. In my case, a 1 CPU, 4 GB RAM setup often errors out multiple times, and in some cases won’t work at all. I’m currently testing with 4CPU/8GB boxes and having no trouble.