question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[cilium] Cilium Operator metrics are not collected by default

See original GitHub issue

And configuring the integration throws errors

Output of the info page

root@datadog-c7rmp:/# agent status
Getting the status from the agent.

2021-09-16 14:33:33 UTC | CORE | WARN | (pkg/util/log/log.go:611 in func1) | Deactivating Autoconfig will disable most components. It's recommended to use autoconfig_exclude_features and autoconfig_include_features to activate/deactivate features selectively
===============
Agent (v7.31.0)
===============

  Status date: 2021-09-16 14:33:33.296 UTC (1631802813296)
  Agent start: 2021-09-16 14:32:48.846 UTC (1631802768846)
  Pid: 1
  Go Version: go1.15.13
  Python Version: 3.8.11
  Build arch: amd64
  Agent flavor: agent
  Check Runners: 4
  Log Level: INFO

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: 26µs
    System time: 2021-09-16 14:33:33.296 UTC (1631802813296)

  Host Info
  =========
    bootTime: 2021-09-16 11:03:22 UTC (1631790202000)
    kernelArch: x86_64
    kernelVersion: 5.4.117-58.216.amzn2.x86_64
    os: linux
    platform: ubuntu
    platformFamily: debian
    platformVersion: 21.04
    procs: 146
    uptime: 3h29m37s

  Hostnames
  =========
    ec2-hostname: ip-10-0-67-197.eu-west-2.compute.internal
    host_aliases: [ip-10-0-67-197.eu-west-2.compute.internal-sandbox-infra-2036]
    hostname: i-07f24cfb6cb2e6b70
    instance-id: i-07f24cfb6cb2e6b70
    socket-fqdn: datadog-c7rmp
    socket-hostname: datadog-c7rmp
    host tags:
      cluster_name:sandbox-infra-2036
      env:sandbox-infra-2036
      kube_cluster_name:sandbox-infra-2036
      project:sandbox-infra-2036
      sla_agreement:false
      stack_name:sandbox-infra-2036
      stack_type:sandbox
    hostname provider: aws
    unused hostname providers:
      azure: azure_hostname_style is set to 'os'
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

  Metadata
  ========
    cloud_provider: AWS
    hostname_source: aws

=========
Collector
=========

  Running Checks
  ==============
    
    cilium (1.7.2)
    --------------
      Instance ID: cilium:8f735fdfa8bcd6cb [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cilium.d/auto_conf.yaml
      Total Runs: 3
      Metric Samples: Last Run: 969, Total: 2,907
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 3
      Average Execution Time : 261ms
      Last Execution Date : 2021-09-16 14:33:29 UTC (1631802809000)
      Last Successful Execution Date : 2021-09-16 14:33:29 UTC (1631802809000)
      metadata:
        version.major: 1
        version.minor: 10
        version.patch: 4
        version.raw: 1.10.4
        version.scheme: semver
      
    
    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
      Total Runs: 3
      Metric Samples: Last Run: 9, Total: 20
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2021-09-16 14:33:26 UTC (1631802806000)
      Last Successful Execution Date : 2021-09-16 14:33:26 UTC (1631802806000)
      
    
    disk (4.4.0)
    ------------
      Instance ID: disk:e5dffb8bef24336f [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
      Total Runs: 2
      Metric Samples: Last Run: 204, Total: 408
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 23ms
      Last Execution Date : 2021-09-16 14:33:18 UTC (1631802798000)
      Last Successful Execution Date : 2021-09-16 14:33:18 UTC (1631802798000)
      
    
    docker
    ------
      Instance ID: docker [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/docker.d/conf.yaml.default
      Total Runs: 2
      Metric Samples: Last Run: 248, Total: 496
      Events: Last Run: 2, Total: 2
      Service Checks: Last Run: 1, Total: 2
      Average Execution Time : 66ms
      Last Execution Date : 2021-09-16 14:33:25 UTC (1631802805000)
      Last Successful Execution Date : 2021-09-16 14:33:25 UTC (1631802805000)
      
    
    file_handle
    -----------
      Instance ID: file_handle [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
      Total Runs: 3
      Metric Samples: Last Run: 5, Total: 15
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2021-09-16 14:33:32 UTC (1631802812000)
      Last Successful Execution Date : 2021-09-16 14:33:32 UTC (1631802812000)
      
    
    io
    --
      Instance ID: io [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
      Total Runs: 2
      Metric Samples: Last Run: 39, Total: 51
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2021-09-16 14:33:24 UTC (1631802804000)
      Last Successful Execution Date : 2021-09-16 14:33:24 UTC (1631802804000)
      
    
    kubelet (7.0.0)
    ---------------
      Instance ID: kubelet:5bbc63f3938c02f4 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
      Total Runs: 2
      Metric Samples: Last Run: 958, Total: 1,901
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 4, Total: 8
      Average Execution Time : 463ms
      Last Execution Date : 2021-09-16 14:33:15 UTC (1631802795000)
      Last Successful Execution Date : 2021-09-16 14:33:15 UTC (1631802795000)
      
    
    load
    ----
      Instance ID: load [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
      Total Runs: 3
      Metric Samples: Last Run: 6, Total: 18
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2021-09-16 14:33:31 UTC (1631802811000)
      Last Successful Execution Date : 2021-09-16 14:33:31 UTC (1631802811000)
      
    
    memory
    ------
      Instance ID: memory [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
      Total Runs: 2
      Metric Samples: Last Run: 18, Total: 36
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2021-09-16 14:33:23 UTC (1631802803000)
      Last Successful Execution Date : 2021-09-16 14:33:23 UTC (1631802803000)
      
    
    network (2.3.0)
    ---------------
      Instance ID: network:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
      Total Runs: 3
      Metric Samples: Last Run: 73, Total: 219
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 10ms
      Last Execution Date : 2021-09-16 14:33:30 UTC (1631802810000)
      Last Successful Execution Date : 2021-09-16 14:33:30 UTC (1631802810000)
      
    
    ntp
    ---
      Instance ID: ntp:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
      Total Runs: 1
      Metric Samples: Last Run: 1, Total: 1
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 1
      Average Execution Time : 0s
      Last Execution Date : 2021-09-16 14:32:56 UTC (1631802776000)
      Last Successful Execution Date : 2021-09-16 14:32:56 UTC (1631802776000)
      
    
    uptime
    ------
      Instance ID: uptime [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
      Total Runs: 2
      Metric Samples: Last Run: 1, Total: 2
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2021-09-16 14:33:22 UTC (1631802802000)
      Last Successful Execution Date : 2021-09-16 14:33:22 UTC (1631802802000)
      
========
JMXFetch
========

  Information
  ==================
  Initialized checks
  ==================
    no checks
    
  Failed checks
  =============
    no checks
    
=========
Forwarder
=========

  Transactions
  ============
    Cluster: 0
    CronJob: 0
    DaemonSet: 0
    Deployment: 0
    Dropped: 0
    DroppedOnInput: 0
    Job: 0
    Node: 0
    PersistentVolume: 0
    PersistentVolumeClaim: 0
    Pod: 0
    ReplicaSet: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Service: 0
    StatefulSet: 0

  Transaction Successes
  =====================
    Total number: 7
    Successes By Endpoint:
      check_run_v1: 2
      intake: 3
      series_v1: 2

  API Keys status
  ===============
    API key ending with 330c5: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 330c5

==========
Logs Agent
==========


  Logs Agent is not running

=========
APM Agent
=========
  Status: Running
  Pid: 1
  Uptime: 44 seconds
  Mem alloc: 10,581,840 bytes
  Hostname: i-07f24cfb6cb2e6b70
  Receiver: 0.0.0.0:8126
  Endpoints:
    https://trace.agent.datadoghq.com

  Receiver (previous minute)
  ==========================
    No traces received in the previous minute.
    Default priority sampling rate: 100.0%

  Writer (previous minute)
  ========================
    Traces: 0 payloads, 0 traces, 0 events, 0 bytes
    Stats: 0 payloads, 0 stats buckets, 0 bytes

=========
Aggregator
=========
  Checks Metric Sample: 6,130
  Dogstatsd Metric Sample: 276
  Event: 3
  Events Flushed: 3
  Number Of Flushes: 2
  Series Flushed: 4,125
  Service Check: 42
  Service Checks Flushed: 33
=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 275
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 21,072
  Udp Packet Reading Errors: 0
  Udp Packets: 215
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 1
  Unterminated Metric Errors: 0

=====================
Datadog Cluster Agent
=====================

  - Datadog Cluster Agent endpoint detected: https://172.20.5.161:5005
  Successfully connected to the Datadog Cluster Agent.
  - Running: 1.15.0+commit.6781e85

Additional environment details (Operating System, Cloud provider, etc): Datadog Helm chart 2.22.2 EKS 1.21 Cilium 1.10.4

Steps to reproduce the issue:

  1. Create an EKS cluster
  2. Deploy Datadog with network policies flavour cilium and custom agent.confd: to collect Cilium operator metrics
  confd:
    cilium.yaml: |-
      instances:
        - agent_endpoint: http://localhost:9090/metrics
          tags:
            - cilium-pod:%%host%%
        - operator_endpoint: http://localhost:6942/metrics
          tags:
            - cilium-pod:%%host%%
  1. Replace AWS CNI with Cilium

Describe the results you received:

$ agent check cilium
...
=== Service Checks ===
[
  {
    "check": "cilium.prometheus.health",
    "host_name": "i-07f24cfb6cb2e6b70",
    "timestamp": 1631803840,
    "status": 0,
    "message": "",
    "tags": [
      "cilium-pod:10.0.67.197",
      "docker_image:quay.io/cilium/cilium",
      "endpoint:http://10.0.67.197:9090/metrics",
      "image_name:quay.io/cilium/cilium",
      "image_tag:v1.10.4",
      "kube_container_name:cilium-agent",
      "kube_daemon_set:cilium",
      "kube_namespace:kube-system",
      "kube_ownerref_kind:daemonset",
      "pod_phase:running",
      "short_image:cilium"
    ]
  }
]
2021-09-16 14:50:44 UTC | CORE | INFO | (pkg/util/kubernetes/clustername/clustername.go:98 in getClusterName) | Using cluster name sandbox-infra-2036 auto discovered from the ec2 API
2021-09-16 14:50:44 UTC | CORE | INFO | (pkg/metadata/host/host_tags.go:88 in GetHostTags) | Adding both tags cluster_name and kube_cluster_name. You can use 'disable_cluster_name_tag_key' in the Agent config to keep the kube_cluster_name tag only
2021-09-16 14:50:44 UTC | CORE | WARN | (pkg/util/gce/gce_tags.go:49 in getCachedTags) | unable to get tags from gce and cache is empty: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/?recursive=true
2021-09-16 14:50:44 UTC | CORE | INFO | (pkg/metadata/host/host.go:170 in getPublicIPv4) | No public IPv4 address found
=========
Collector
=========

  Running Checks
  ==============
    
    cilium (1.7.2)
    --------------
      Instance ID: cilium:50ebf894b0a553d7 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cilium.d/auto_conf.yaml
      Total Runs: 1
      Metric Samples: Last Run: 971, Total: 971
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 1
      Average Execution Time : 295ms
      Last Execution Date : 2021-09-16 14:50:41 UTC (1631803841000)
      Last Successful Execution Date : 2021-09-16 14:50:41 UTC (1631803841000)
      metadata:
        version.major: 1
        version.minor: 10
        version.patch: 4
        version.raw: 1.10.4
        version.scheme: semver
      

=== Service Checks ===
[
  {
    "check": "cilium.prometheus.health",
    "host_name": "i-07f24cfb6cb2e6b70",
    "timestamp": 1631803844,
    "status": 2,
    "message": "",
    "tags": [
      "cilium-pod:%%host%%",
      "endpoint:http://localhost:9090/metrics"
    ]
  }
]
=========
Collector
=========

  Running Checks
  ==============
    
    cilium (1.7.2)
    --------------
      Instance ID: cilium:8eb256ce5f51f2ec [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/cilium.yaml
      Total Runs: 1
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 1
      Average Execution Time : 8ms
      Last Execution Date : 2021-09-16 14:50:44 UTC (1631803844000)
      Last Successful Execution Date : Never
      Error: HTTPConnectionPool(host='localhost', port=9090): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f785920e910>: Failed to establish a new connection: [Errno 111] Connection refused'))
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 169, in _new_conn
          conn = connection.create_connection(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 96, in create_connection
          raise err
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 86, in create_connection
          sock.connect(sa)
      ConnectionRefusedError: [Errno 111] Connection refused
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
          httplib_response = self._make_request(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 394, in _make_request
          conn.request(method, url, **httplib_request_kw)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 234, in request
          super(HTTPConnection, self).request(method, url, body=body, headers=headers)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1256, in request
          self._send_request(method, url, body, headers, encode_chunked)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1302, in _send_request
          self.endheaders(body, encode_chunked=encode_chunked)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1251, in endheaders
          self._send_output(message_body, encode_chunked=encode_chunked)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1011, in _send_output
          self.send(msg)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 951, in send
          self.connect()
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 200, in connect
          conn = self._new_conn()
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 181, in _new_conn
          raise NewConnectionError(
      urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f785920e910>: Failed to establish a new connection: [Errno 111] Connection refused
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
          resp = conn.urlopen(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen
          retries = retries.increment(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/retry.py", line 574, in increment
          raise MaxRetryError(_pool, url, error or ResponseError(cause))
      urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=9090): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f785920e910>: Failed to establish a new connection: [Errno 111] Connection refused'))
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 1006, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/base_check.py", line 135, in check
          self.process(scraper_config)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 533, in process
          for metric in self.scrape_metrics(scraper_config):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 470, in scrape_metrics
          response = self.poll(scraper_config)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 780, in poll
          response = self.send_request(endpoint, scraper_config, headers)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 806, in send_request
          return http_handler.get(endpoint, stream=True, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 304, in get
          return self._request('get', url, options)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 368, in _request
          response = self.make_request_aia_chasing(request_method, method, url, new_options, persist)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 373, in make_request_aia_chasing
          response = request_method(url, **new_options)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 76, in get
          return request('get', url, params=params, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 61, in request
          return session.request(method=method, url=url, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 542, in request
          resp = self.send(prep, **send_kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 655, in send
          r = adapter.send(request, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 516, in send
          raise ConnectionError(e, request=request)
      requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=9090): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f785920e910>: Failed to establish a new connection: [Errno 111] Connection refused'))

=== Service Checks ===
[
  {
    "check": "cilium.prometheus.health",
    "host_name": "i-07f24cfb6cb2e6b70",
    "timestamp": 1631803844,
    "status": 2,
    "message": "",
    "tags": [
      "cilium-pod:%%host%%",
      "endpoint:http://localhost:6942/metrics"
    ]
  }
]
=========
Collector
=========

  Running Checks
  ==============
    
    cilium (1.7.2)
    --------------
      Instance ID: cilium:10d6f36b45683928 [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/cilium.yaml
      Total Runs: 1
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 1
      Average Execution Time : 3ms
      Last Execution Date : 2021-09-16 14:50:44 UTC (1631803844000)
      Last Successful Execution Date : Never
      Error: HTTPConnectionPool(host='localhost', port=6942): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f785899a6d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 169, in _new_conn
          conn = connection.create_connection(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 96, in create_connection
          raise err
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 86, in create_connection
          sock.connect(sa)
      ConnectionRefusedError: [Errno 111] Connection refused
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
          httplib_response = self._make_request(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 394, in _make_request
          conn.request(method, url, **httplib_request_kw)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 234, in request
          super(HTTPConnection, self).request(method, url, body=body, headers=headers)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1256, in request
          self._send_request(method, url, body, headers, encode_chunked)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1302, in _send_request
          self.endheaders(body, encode_chunked=encode_chunked)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1251, in endheaders
          self._send_output(message_body, encode_chunked=encode_chunked)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1011, in _send_output
          self.send(msg)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 951, in send
          self.connect()
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 200, in connect
          conn = self._new_conn()
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 181, in _new_conn
          raise NewConnectionError(
      urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f785899a6d0>: Failed to establish a new connection: [Errno 111] Connection refused
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
          resp = conn.urlopen(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen
          retries = retries.increment(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/retry.py", line 574, in increment
          raise MaxRetryError(_pool, url, error or ResponseError(cause))
      urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=6942): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f785899a6d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 1006, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/base_check.py", line 135, in check
          self.process(scraper_config)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 533, in process
          for metric in self.scrape_metrics(scraper_config):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 470, in scrape_metrics
          response = self.poll(scraper_config)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 780, in poll
          response = self.send_request(endpoint, scraper_config, headers)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 806, in send_request
          return http_handler.get(endpoint, stream=True, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 304, in get
          return self._request('get', url, options)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 368, in _request
          response = self.make_request_aia_chasing(request_method, method, url, new_options, persist)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 373, in make_request_aia_chasing
          response = request_method(url, **new_options)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 76, in get
          return request('get', url, params=params, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 61, in request
          return session.request(method=method, url=url, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 542, in request
          resp = self.send(prep, **send_kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 655, in send
          r = adapter.send(request, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 516, in send
          raise ConnectionError(e, request=request)
      requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=6942): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f785899a6d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

Check has run only once, if some metrics are missing you can try again with --check-rate to see any other metric if available.

Describe the results you expected:

Datadog agent to see that the operator PODs do not run in all the nodes but in the ones it does it should collect the metrics from localhost because Cilium operator container runs with hostNetwork: true

Additional information you deem important (e.g. issue happens only occasionally):

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
carlosjgpcommented, Sep 17, 2021

@ChristineTChen you are absolutely right!

All the changes distracted me from changing the address. Now that we are using the Kubernetes Endpoints provided by the headless Kubernetes service the autodiscovery configuration should use %%host%% instead of localhost

I’ll test the changes next week and confirm this

Thanks!

1reaction
ChristineTChencommented, Sep 17, 2021

Hey @carlosjgp ,

I noticed that you port-forwarded the port in your earlier comment. The curl to localhost only works here because you forwarded the port of the cilium-operator.

$ kubectl -n kube-system port-forward cilium-operator-6fc456995d-xt4wm 6942 &
$ curl localhost:6942/metrics

I don’t think localhost:6942/metrics is going to yield the same output in the Datadog agent container. You test this out by running the curl from an Agent pod.

In your annotations, can you try replacing localhost with %%host%%?

metadata:
  annotations:
    ad.datadoghq.com/endpoints.check_names: |2
        ["cilium"]
    ad.datadoghq.com/endpoints.init_configs: '[{}]'
    ad.datadoghq.com/endpoints.instances: |
        [{
          "operator_endpoint": "http://%%host%%:6942/metrics"
        }]
Read more comments on GitHub >

github_iconTop Results From Across the Web

Monitoring & Metrics — Cilium 1.13.90 documentation
Prometheus is a pluggable metrics collection and storage system and can act as a data source ... By default, the operator will expose...
Read more >
Issue with Prometheus scraping cilium-agent #17949 - GitHub
The metrics are per node. The Service is headless to just give Prometheus the right annotations to scrape, but I'm confused by the...
Read more >
Key Metrics for Monitoring Cilium - Datadog
This metric captures the total number of nodes in which the Cilium Operator is not able to allocate more IP addresses. A high...
Read more >
Cilium Enterprise integration | Grafana Cloud documentation
The following sample using Helm enables the Prometheus metrics endpoint and configures the relevant metrics for Cilium Agent, Cilium Operator, and Hubble:
Read more >
cilium 1.13.0-rc3 · joaquinito2051/cilium2 - Artifact Hub
Key Type Default MTU int 0 affinity object agent bool true
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found