question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ChannelClosedException due to spurious termination of running containers

See original GitHub issue

Setup

  • docker-plugin version: 1.1.4
  • jenkins version: 2.107.3(LTS) but also confirmed with recent weeklys as of May 2018
  • docker engine version: 18.02.0~ce-0~debian
  • Setup consists of a Jenkins master with one additional host running Docker. Agents are started via the ssh-agent connect method.

Symptom

A low percentage of our Jenkins jobs end up hanging. Affected executors are shown as docker-5d5fedb1d75d0 (offline) (suspended) and the job itself contains the output:

Cannot contact docker-5d5fedb1d75d0:
    java.io.IOException:
        remote file operation failed: /home/jenkins/workspace/x/y/z/systemtests@2/test_definitions at hudson.remoting.Channel@1ad4d27c:docker-5d5fedb1d75d0:
        hudson.remoting.ChannelClosedException: 
            Channel "unknown": Remote call on docker-5d5fedb1d75d0 failed. The channel is closing down or has closed down

Jenkins master logs indicate that something decided to terminate running containers. Sometimes this happens after a few minutes, but sometimes only after 1-2 hours for long running jobs.

May 22 20:57:05 s5 java[29653]: INFO: Trying to run container for node 5d5fedb1d75d0 from image: images-build.blue-yonder.org:5000/by/debian_8_jenkins:stable
May 22 20:57:06 s5 java[29653]: INFO: Started container ID 99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d for node 5d5fedb1d75d0 from image: images-build.blue-yonder.org:5000/by/debian_8_jenkins:stable
May 22 20:57:11 s5 java[29653]: [05/22/18 20:57:11] SSH Launch of docker-5d5fedb1d75d0 on 10.1.32.51 completed in 2,523 ms
May 22 21:08:19 s5 java[29653]: INFO: Disconnected computer for slave 'docker-5d5fedb1d75d0'.
May 22 21:08:19 s5 java[29653]: INFO: Removed Node for slave 'docker-5d5fedb1d75d0'.
May 22 21:08:21 s5 java[29653]: INFO: Stopped container '99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d' for slave 'docker-5d5fedb1d75d0'.
May 22 21:08:23 s5 java[29653]: INFO: Removed container '99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d' for slave 'docker-5d5fedb1d75d0'.
May 22 21:09:16 s5 java[29653]: INFO: Attempting to reconnect docker-5d5fedb1d75d0
May 22 21:09:16 s5 java[29653]: com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: 99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d"}
May 22 21:11:16 s5 java[29653]: INFO: Attempting to reconnect docker-5d5fedb1d75d0
May 22 21:11:16 s5 java[29653]: com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: 99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d"}
May 22 21:13:16 s5 java[29653]: INFO: Attempting to reconnect docker-5d5fedb1d75d0
May 22 21:13:16 s5 java[29653]: com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: 99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d"}
May 22 21:15:16 s5 java[29653]: INFO: Attempting to reconnect docker-5d5fedb1d75d0
May 22 21:15:16 s5 java[29653]: com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: 99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d"}
May 22 21:15:59 s5 java[29653]: WARNING: Caught exception evaluating: it.containerId in /computer/docker-5d5fedb1d75d0/. Reason: java.lang.reflect.InvocationTargetException
May 22 21:15:59 s5 java[29653]: WARNING: Caught exception evaluating: it.containerId in /computer/docker-5d5fedb1d75d0/. Reason: java.lang.reflect.InvocationTargetException
May 22 21:15:59 s5 java[29653]: WARNING: Caught exception evaluating: it.containerId in /computer/docker-5d5fedb1d75d0/. Reason: java.lang.reflect.InvocationTargetException
May 22 21:15:59 s5 java[29653]: WARNING: Caught exception evaluating: it.containerId in /computer/docker-5d5fedb1d75d0/. Reason: java.lang.reflect.InvocationTargetException
May 22 21:15:59 s5 java[29653]: WARNING: Caught exception evaluating: it.containerId in /computer/docker-5d5fedb1d75d0/. Reason: java.lang.reflect.InvocationTargetException
May 22 21:17:16 s5 java[29653]: INFO: Attempting to reconnect docker-5d5fedb1d75d0
May 22 21:17:16 s5 java[29653]: com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: 99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d"}

The interesting bit is that Disconnected computer for slave 'docker-5d5fedb1d75d0'. The context of this log message looks like this:

May 22 21:08:19 s5 java[29653]: May 22, 2018 9:08:19 PM hudson.model.Run execute
May 22 21:08:19 s5 java[29653]: INFO: admin/test_labels #657 main build action completed: SUCCESS
May 22 21:08:19 s5 java[29653]: May 22, 2018 9:08:19 PM io.jenkins.docker.DockerTransientNode$1 println
May 22 21:08:19 s5 java[29653]: INFO: Disconnected computer for slave 'docker-5d5fedb1d75d0'.
May 22 21:08:19 s5 java[29653]: May 22, 2018 9:08:19 PM io.jenkins.docker.DockerTransientNode$1 println
May 22 21:08:19 s5 java[29653]: INFO: Removed Node for slave 'docker-5d5fedb1d75d0'.
May 22 21:08:19 s5 java[29653]: May 22, 2018 9:08:19 PM io.jenkins.docker.DockerTransientNode$1 println
May 22 21:08:19 s5 java[29653]: INFO: Removed container 'f6f330ff09db37fcc9e5dfb813634c0d5276044baec73fc4affadbe9e2d2d898' for slave 'docker-5d69904a90f09'.
May 22 21:08:19 s5 java[29653]: May 22, 2018 9:08:19 PM io.jenkins.docker.DockerTransientNode$1 println
May 22 21:08:19 s5 java[29653]: INFO: Stopped container '827e4ffa9e75253adfe4f5eb83d428f2e4b2602564861cb17a9b299c7b22a9cd' for slave 'docker-5d698fd47871d'.
May 22 21:08:21 s5 java[29653]: May 22, 2018 9:08:21 PM io.jenkins.docker.DockerTransientNode$1 println
May 22 21:08:21 s5 java[29653]: INFO: Stopped container '99356f7db1185054da6d62f18f543eb9213e92c950881a1d1e1794115c57dc6d' for slave 'docker-5d5fedb1d75d0'.

So what seems to happen that a unrelated job finishes but then our container gets terminated/disconnected. This could be accidental due to ordering of logs, but the behaviour seems to show up in all examples I have looked: A job finishes and then seemingly the container/agent of another job gets terminated.

The source of the log messages seem to indicate that the docker plugin is involved here. Is there a chance for a race condition that could lead to the describe behaviour?

Thanks for your input!

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

7reactions
willis7commented, Sep 27, 2018

My last message was pretty rubbish - and fairly inaccurate. After some investigation I found an error code 137 after calling docker inspect on a terminated container. The docker docs suggest this is similar to a kill -9 which hinted towards an external actor killing the container.

My colleague spotted that a flag in Jenkins may be the culprit.

Jenkins -> Manage Jenkins -> Manage Nodes -> Configure -> Response Time

This monitors the round trip network response time from the master to the agent, and if it goes above a threshold repeatedly, it marks the agent offline.This is useful for detecting unresponsive agents, or other network problems that clog the communication channel. More specifically, the master sends a no-op command to the agent, and checks the time it takes to get back the result of this no-op command.

Once I unchecked this flag, we got consistent build passes.

4reactions
ajorpheuscommented, Sep 27, 2018

To supplement @willis7’s post, what you should look for is the following snippet in the docker inspect log for the container(s):

 "State": {
            "Status": "exited",
			.....
            "OOMKilled": false,
			.....
            "ExitCode": 137,
            "Error": "",
 }

Note that the ExitCode of 137 is clearly documented here as being equivalent to a kill -9, suggesting that the container is being killed. Additionally, note that an ExitCode 137 may also be triggered by an out of memory error, so watch out for the value of OOMKilled in the State of the killed container.

To rule out a co-incidence, the “Response Time” was enabled and the builds began failing intermittently with the ChannelCloseException again.

TL;DR: Try disabling “Response Time” 2018-09-27_13-08-20 1

(We did not have to restart Jenkins to see the effect of this setting on the builds)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Channel "unknown": Remote call on JNLP4-connect ...
We're running on demand worker nodes on EKS (no spots used). According to our monitoring JNLP container never uses more than 1.2 GB...
Read more >
How do I troubleshoot OutOfMemory errors in Amazon ECS?
The containers in my Amazon ECS task are exiting due to OutOfMemory ... Be sure not to allow a running container to consume...
Read more >
Runtime defense for containers - Prisma Cloud
— Bypass runtime rules when attaching to running containers or pods. This control lets developers and DevOps engineers troubleshoot and investigate issues in ......
Read more >
Graceful Termination of Linux Containers (Exit Code 143)
A SIGTERM handler makes sense in any program which could be interrupted during a long-lived operation that needs to run to completion. Using...
Read more >
Determine the Reason for Pod Failure - Kubernetes
Writing and reading a termination message. In this exercise, you create a Pod that runs one container. The manifest for that Pod specifies...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found