ChannelClosedException due to spurious termination of running containers
See original GitHub issueSetup
- docker-plugin version:
1.1.4 - jenkins version:
2.107.3(LTS) but also confirmed with recent weeklys as of May 2018 - docker engine version:
18.02.0~ce-0~debian - Setup consists of a Jenkins master with one additional host running Docker. Agents are started via the ssh-agent connect method.
Symptom
A low percentage of our Jenkins jobs end up hanging. Affected executors are shown as docker-5d5fedb1d75d0 (offline) (suspended) and the job itself contains the output:
Cannot contact docker-5d5fedb1d75d0:
java.io.IOException:
remote file operation failed: /home/jenkins/workspace/x/y/z/systemtests@2/test_definitions at hudson.remoting.Channel@1ad4d27c:docker-5d5fedb1d75d0:
hudson.remoting.ChannelClosedException:
Channel "unknown": Remote call on docker-5d5fedb1d75d0 failed. The channel is closing down or has closed down
Jenkins master logs indicate that something decided to terminate running containers. Sometimes this happens after a few minutes, but sometimes only after 1-2 hours for long running jobs.
May 22 20:57:05 s5 java[29653]: INFO: Trying to run container for node 5d5fedb1d75d0 from image: images-build.blue-yonder.org:5000/by/debian_8_jenkins:stable
May 22 20:57:06 s5 java[29653]: INFO: Started container ID 99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d for node 5d5fedb1d75d0 from image: images-build.blue-yonder.org:5000/by/debian_8_jenkins:stable
May 22 20:57:11 s5 java[29653]: [05/22/18 20:57:11] SSH Launch of docker-5d5fedb1d75d0 on 10.1.32.51 completed in 2,523 ms
May 22 21:08:19 s5 java[29653]: INFO: Disconnected computer for slave 'docker-5d5fedb1d75d0'.
May 22 21:08:19 s5 java[29653]: INFO: Removed Node for slave 'docker-5d5fedb1d75d0'.
May 22 21:08:21 s5 java[29653]: INFO: Stopped container '99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d' for slave 'docker-5d5fedb1d75d0'.
May 22 21:08:23 s5 java[29653]: INFO: Removed container '99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d' for slave 'docker-5d5fedb1d75d0'.
May 22 21:09:16 s5 java[29653]: INFO: Attempting to reconnect docker-5d5fedb1d75d0
May 22 21:09:16 s5 java[29653]: com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: 99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d"}
May 22 21:11:16 s5 java[29653]: INFO: Attempting to reconnect docker-5d5fedb1d75d0
May 22 21:11:16 s5 java[29653]: com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: 99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d"}
May 22 21:13:16 s5 java[29653]: INFO: Attempting to reconnect docker-5d5fedb1d75d0
May 22 21:13:16 s5 java[29653]: com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: 99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d"}
May 22 21:15:16 s5 java[29653]: INFO: Attempting to reconnect docker-5d5fedb1d75d0
May 22 21:15:16 s5 java[29653]: com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: 99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d"}
May 22 21:15:59 s5 java[29653]: WARNING: Caught exception evaluating: it.containerId in /computer/docker-5d5fedb1d75d0/. Reason: java.lang.reflect.InvocationTargetException
May 22 21:15:59 s5 java[29653]: WARNING: Caught exception evaluating: it.containerId in /computer/docker-5d5fedb1d75d0/. Reason: java.lang.reflect.InvocationTargetException
May 22 21:15:59 s5 java[29653]: WARNING: Caught exception evaluating: it.containerId in /computer/docker-5d5fedb1d75d0/. Reason: java.lang.reflect.InvocationTargetException
May 22 21:15:59 s5 java[29653]: WARNING: Caught exception evaluating: it.containerId in /computer/docker-5d5fedb1d75d0/. Reason: java.lang.reflect.InvocationTargetException
May 22 21:15:59 s5 java[29653]: WARNING: Caught exception evaluating: it.containerId in /computer/docker-5d5fedb1d75d0/. Reason: java.lang.reflect.InvocationTargetException
May 22 21:17:16 s5 java[29653]: INFO: Attempting to reconnect docker-5d5fedb1d75d0
May 22 21:17:16 s5 java[29653]: com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: 99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d"}
The interesting bit is that Disconnected computer for slave 'docker-5d5fedb1d75d0'. The context of this log message looks like this:
May 22 21:08:19 s5 java[29653]: May 22, 2018 9:08:19 PM hudson.model.Run execute
May 22 21:08:19 s5 java[29653]: INFO: admin/test_labels #657 main build action completed: SUCCESS
May 22 21:08:19 s5 java[29653]: May 22, 2018 9:08:19 PM io.jenkins.docker.DockerTransientNode$1 println
May 22 21:08:19 s5 java[29653]: INFO: Disconnected computer for slave 'docker-5d5fedb1d75d0'.
May 22 21:08:19 s5 java[29653]: May 22, 2018 9:08:19 PM io.jenkins.docker.DockerTransientNode$1 println
May 22 21:08:19 s5 java[29653]: INFO: Removed Node for slave 'docker-5d5fedb1d75d0'.
May 22 21:08:19 s5 java[29653]: May 22, 2018 9:08:19 PM io.jenkins.docker.DockerTransientNode$1 println
May 22 21:08:19 s5 java[29653]: INFO: Removed container 'f6f330ff09db37fcc9e5dfb813634c0d5276044baec73fc4affadbe9e2d2d898' for slave 'docker-5d69904a90f09'.
May 22 21:08:19 s5 java[29653]: May 22, 2018 9:08:19 PM io.jenkins.docker.DockerTransientNode$1 println
May 22 21:08:19 s5 java[29653]: INFO: Stopped container '827e4ffa9e75253adfe4f5eb83d428f2e4b2602564861cb17a9b299c7b22a9cd' for slave 'docker-5d698fd47871d'.
May 22 21:08:21 s5 java[29653]: May 22, 2018 9:08:21 PM io.jenkins.docker.DockerTransientNode$1 println
May 22 21:08:21 s5 java[29653]: INFO: Stopped container '99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d' for slave 'docker-5d5fedb1d75d0'.
So what seems to happen that a unrelated job finishes but then our container gets terminated/disconnected. This could be accidental due to ordering of logs, but the behaviour seems to show up in all examples I have looked: A job finishes and then seemingly the container/agent of another job gets terminated.
The source of the log messages seem to indicate that the docker plugin is involved here. Is there a chance for a race condition that could lead to the describe behaviour?
Thanks for your input!
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:12 (5 by maintainers)

Top Related StackOverflow Question
My last message was pretty rubbish - and fairly inaccurate. After some investigation I found an error code 137 after calling
docker inspecton a terminated container. The docker docs suggest this is similar to akill -9which hinted towards an external actor killing the container.My colleague spotted that a flag in Jenkins may be the culprit.
Jenkins -> Manage Jenkins -> Manage Nodes -> Configure -> Response TimeThis monitors the round trip network response time from the master to the agent, and if it goes above a threshold repeatedly, it marks the agent offline.This is useful for detecting unresponsive agents, or other network problems that clog the communication channel. More specifically, the master sends a no-op command to the agent, and checks the time it takes to get back the result of this no-op command.
Once I unchecked this flag, we got consistent build passes.
To supplement @willis7’s post, what you should look for is the following snippet in the
docker inspectlog for the container(s):Note that the ExitCode of 137 is clearly documented here as being equivalent to a
kill -9, suggesting that the container is being killed. Additionally, note that an ExitCode 137 may also be triggered by an out of memory error, so watch out for the value ofOOMKilledin theStateof the killed container.To rule out a co-incidence, the “Response Time” was enabled and the builds began failing intermittently with the ChannelCloseException again.
TL;DR: Try disabling “Response Time”
(We did not have to restart Jenkins to see the effect of this setting on the builds)