Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Connection state metrics for dockerized agent and host networking

See original GitHub issue

Output of the info page

====================
Collector (v 5.22.3)
====================

  Status date: 2018-05-15 05:16:49 (1s ago)
  Pid: 6095
  Platform: Linux-4.14.32-coreos-x86_64-with
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /opt/datadog-agent/logs/collector.log

  Clocks
  ======

    NTP offset: 0.0011 s
    System UTC time: 2018-05-15 05:16:51.565257

  Paths
  =====

    conf.d: /opt/datadog-agent/agent/conf.d
    checks.d: /opt/datadog-agent/agent/checks.d

  Hostnames
  =========

    ec2-hostname: ip-10-0-2-200.ec2.internal
    local-ipv4: 10.0.2.200
    local-hostname: ip-10-0-2-200.ec2.internal
    socket-hostname: ip-10-0-2-200.ec2.internal
    public-hostname: ec2-34-229-87-92.compute-1.amazonaws.com
    hostname: i-0de44ea41cc34c069
    instance-id: i-0de44ea41cc34c069
    public-ipv4: 34.229.87.92
    socket-fqdn: 10.0.2.200

  Checks
  ======

    linux_proc_extras (1.0.0)
    -------------------------
      - instance #0 [ERROR]: 'get_subprocess_output expected output but had none.'
      - Collected 6 metrics, 0 events & 0 service checks

    network (1.4.0)
    ---------------
      - instance #0 [WARNING]
          Warning: Cannot collect connection state: currently with a custom /proc path: /host/proc/1
      - Collected 20 metrics, 0 events & 0 service checks

    ntp (1.0.0)
    -----------
      - Collected 0 metrics, 0 events & 0 service checks

    cassandra_nodetool (0.1.1)
    --------------------------
      - instance #0 [OK]
      - Collected 16 metrics, 0 events & 3 service checks

    consul (1.3.0)
    --------------
      - instance #0 [OK]
      - Collected 1 metric, 0 events & 0 service checks

    disk (1.1.0)
    ------------
      - instance #0 [OK]
      - Collected 34 metrics, 0 events & 0 service checks

    docker_daemon (1.8.0)
    ---------------------
      - instance #0 [OK]
      - Collected 29 metrics, 0 events & 1 service check

    cassandra (5.22.3)
    ------------------
      - instance #cassandra-localhost-7199 [WARNING] collected 350 metrics
          Warning: Number of returned metrics is too high for instance: cassandra-localhost-7199. Please read http://docs.datadoghq.com/integrations/java/ or get in touch with Datadog Support for more details. Truncating to 350 metrics.
      - Collected 350 metrics, 0 events & 0 service checks


  Emitters
  ========

    - http_emitter [OK]

====================
Dogstatsd (v 5.22.3)
====================

  Status date: 2018-05-15 05:16:41 (9s ago)
  Pid: 6093
  Platform: Linux-4.14.32-coreos-x86_64-with
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /opt/datadog-agent/logs/dogstatsd.log

  Flush count: 147
  Packet Count: 64202
  Packets per second: 56.6
  Metric count: 420
  Event count: 0
  Service check count: 0

====================
Forwarder (v 5.22.3)
====================

  Status date: 2018-05-15 05:16:51 (0s ago)
  Pid: 6094
  Platform: Linux-4.14.32-coreos-x86_64-with
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /opt/datadog-agent/logs/forwarder.log

  Queue Size: 7704 bytes
  Queue Length: 3
  Flush Count: 498
  Transactions received: 404
  Transactions flushed: 401
  Transactions rejected: 0
  API Key Status: API Key is valid

Additional environment details (Operating System, Cloud provider, etc):

CoreOS 1688.5.3 running on AWS

Steps to reproduce the issue:

Build a docker image based off the latest alpine image with the following network.yaml check configuration built-in:

init_config:

instances:
  - collect_connection_state: true
    excluded_interfaces:
      - lo
      - lo0
      - docker0
    # Ignore Docker's virtual interfaces:
    excluded_interface_re: veth*

Run the datadog agent container with the following mounts: -v /var/run/docker.sock:/var/run/docker.sock -v /proc/:/host/proc/:ro -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro -v /etc/passwd:/etc/passwd:ro
Run docker exec -it datadog /opt/datadog-agent/bin/agent info

Describe the results you received:

The network check fails to capture some host network metrics.

Describe the results you expected:

The network check should work as it did in previous versions of the agent.

Additional information you deem important (e.g. issue happens only occasionally):

This is the same issue as #1131. I don’t believe the solution provided in that issue was correct. The problem occurs due to a combination of issues.

First, the solution in #1131 suggested setting the procfs_path in process.yaml. However, not only is that deprecated, it won’t actually work. The check will ignore any value of procfs_path that is different from the agent config.

The suggested solution also mentions overriding procfs_path for the network check. However, the network check does not read a procfs_path from its init_conf. It only honors the procfs_path from the agent config.

Also, since the procfs_path is now an agent-wide setting, it seems problematic to override it to a value that only fixes one check. Instead, the warning should be ignorable when host networking is used in a container.

Issue Analytics

State:
Created 5 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

2reactions

gjwnccommented, Feb 7, 2019

Using the following docker-compose.yaml and Dockerfile I got, afaik, correct data for the system.net.tcp4.[opening|listening|established|…] metrices.

The main key point here is that the environment variable DD_PROCFS_PATH is set to /proc which is by default /host/proc in a dockerized environment, the network_mode is set to host and that the relevant tools like ss and netstat are installed. The last one is the reason I needed a custom Dockerfile. The package iproute2 installs ss and net-tools installs netstat. Without the Dockerfile or the (ss and netstat) command, I got the error Error collecting connection stats. (see https://github.com/DataDog/integrations-core/blob/master/network/datadog_checks/network/network.py#L363).

Apart from the needed Dockerfile the config is similar to those mentioned in #1131.

Dockerfile:

FROM datadog/agent:6.9.0

RUN apt-get update && apt-get install -y iproute2 net-tools

docker-compose.yaml:

version: '2'
services:
  datadog:
    build: .
    container_name: dd-agent
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /proc/:/host/proc/:ro
      - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
      - /mnt/asset/:/mnt/asset:ro
      - /:/mnt/root_partition:ro
      - ./conf.d:/conf.d:ro
    environment:
      - DD_LOG_LEVEL=warning
      - DD_API_KEY=12345
      - SD_BACKEND=docker
      - NON_LOCAL_TRAFFIC=false
      - DD_APM_ENABLED=true
      - DD_PROCFS_PATH=/proc
    network_mode: host
    restart: always
    logging:
      driver: "json-file"
      options:
        max-size: "50m"
        max-file: "3"

I’m not quite sure if network.d/conf.yaml and/or process.d/conf.yaml is relevant, but here is my config: conf.d/network.d/conf.yaml

init_config:
  procfs_path: /proc

instances:
    # Network check only supports one configured instance
  - collect_connection_state: true # set to true to collect TCP connection state metrics, e.g. SYN_SENT, ESTABLISHED
    excluded_interfaces: # the check will collect metrics on all other interfaces
      - lo
      - lo0
# ignore any network interface matching the given regex:
#   excluded_interface_re: eth1.*

conf.d/process.d/conf.yaml

init_config:
  procfs_path: /proc

0reactions

stale[bot]commented, Mar 9, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there’s been inactivity. Thank you for participating in the Datadog open source community.