question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[openmetrics] Offer option to send histograms as bucket counters AND distribution metrics

See original GitHub issue

Output of the info page

===============
Agent (v7.16.1)
===============
  Status date: 2020-02-25 22:22:05.129654 UTC
  Agent start: 2020-02-25 19:45:25.923801 UTC
  Pid: 338
  Go Version: go1.12.9
  Python Version: 3.7.4
  Build arch: amd64
  Check Runners: 4
  Log Level: WARN
  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d
  Clocks
  ======
    NTP offset: -6.989ms
    System UTC time: 2020-02-25 22:22:05.129654 UTC
  Host Info
  =========
    bootTime: 2020-02-10 21:50:08.000000 UTC
    kernelVersion: 3.10.0-1062.1.1.el7.x86_64
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 10.2
    procs: 67
    uptime: 357h55m27s
    virtualizationRole: guest
    virtualizationSystem: docker
  Hostnames
  =========
    host_aliases: [ocp-app-01q.lab1.bwnet.us]
    hostname: ocp-app-01q.lab1.bwnet.us
    socket-fqdn: datadog-agent-jnpb5
    socket-hostname: datadog-agent-jnpb5
    host tags:
      [cluster:cluster.lab1]
    hostname provider: container
    unused hostname providers:
      aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: Get http://169.254.169.254/computeMetadata/v1/instance/hostname: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Metadata
  ========
    hostname_source: container
=========
Collector
=========
  Running Checks
  ==============
    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
      Total Runs: 627
      Metric Samples: Last Run: 6, Total: 3,756
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
    disk (2.5.3)
    ------------
      Instance ID: disk:e5dffb8bef24336f [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
      Total Runs: 626
      Metric Samples: Last Run: 452, Total: 282,952
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 1.378s
    docker
    ------
      Instance ID: docker [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/docker.d/conf.yaml.default
      Total Runs: 619
      Metric Samples: Last Run: 8,820, Total: 1 M
      Events: Last Run: 0, Total: 716
      Service Checks: Last Run: 1, Total: 619
      Average Execution Time : 3.658s
    file_handle
    -----------
      Instance ID: file_handle [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
      Total Runs: 626
      Metric Samples: Last Run: 5, Total: 3,130
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
    io
    --
      Instance ID: io [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
      Total Runs: 626
      Metric Samples: Last Run: 455, Total: 284,515
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 3ms
    kubelet (3.4.0)
    ---------------
      Instance ID: kubelet:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
      Total Runs: 626
      Metric Samples: Last Run: 3,814, Total: 1 M
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 4, Total: 2,504
      Average Execution Time : 6.386s
    load
    ----
      Instance ID: load [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
      Total Runs: 626
      Metric Samples: Last Run: 6, Total: 3,756
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
    memory
    ------
      Instance ID: memory [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
      Total Runs: 626
      Metric Samples: Last Run: 17, Total: 10,642
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
    network (1.12.2)
    ----------------
      Instance ID: network:e0204ad63d43c949 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
      Total Runs: 626
      Metric Samples: Last Run: 31, Total: 19,406
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 105ms
    ntp
    ---
      Instance ID: ntp:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
      Total Runs: 11
      Metric Samples: Last Run: 1, Total: 11
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 11
      Average Execution Time : 981ms
    openmetrics (1.3.0)
    -------------------
      Instance ID: openmetrics:argo:24ca5ec29bda41f3 [OK]
      Configuration Source: kubelet:docker://bcc5c94c61083c1fe11de4936f12f733ececa08414e66c44bdd5ac800d673246
      Total Runs: 86
      Metric Samples: Last Run: 89, Total: 7,627
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 86
      Average Execution Time : 29ms
      Instance ID: openmetrics:argo:e822becabdb98fb [OK]
      Configuration Source: kubelet:docker://d5414e1f44598239e76bef98f8ac623d20596ca8e99fc960c80d827faeb8a208
      Total Runs: 158
      Metric Samples: Last Run: 84, Total: 13,023
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 158
      Average Execution Time : 50ms
      Instance ID: openmetrics:jenkins:1cb698262b8693f [OK]
      Configuration Source: kubelet:docker://31b523ea583d7a7698dddde960d108432d1db81be4df5e4d7b40f2a77c628a31
      Total Runs: 626
      Metric Samples: Last Run: 1,223, Total: 765,570
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 626
      Average Execution Time : 493ms
      Instance ID: openmetrics:jenkins:4098d8357e61abc9 [OK]
      Configuration Source: kubelet:docker://5c0eb29d1c1f0a4e9cb19983a5b2d02ccf466ea0b3b92f4d00f6c18f588cd765
      Total Runs: 626
      Metric Samples: Last Run: 52, Total: 32,552
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 626
      Average Execution Time : 161ms
      Instance ID: openmetrics:jenkins:6ff158361907b312 [OK]
      Configuration Source: kubelet:docker://c4d08888c74a76777b9e2970f0d77b30c5007e85a71e144ba6a5ae2c78f40a47
      Total Runs: 626
      Metric Samples: Last Run: 1,202, Total: 752,452
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 626
      Average Execution Time : 544ms
      Instance ID: openmetrics:jenkins:804a23f6b7eb0cea [OK]
      Configuration Source: kubelet:docker://fa5867c5238ecd5d3b12e679ca9c5c72108b4c010568137667489ae5a5efb12d
      Total Runs: 626
      Metric Samples: Last Run: 52, Total: 32,552
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 626
      Average Execution Time : 12ms
      Instance ID: openmetrics:one-id:6f2950c5f7b7a6ab [OK]
      Configuration Source: kubelet:docker://5ed03916f3709d75f249ea1c02daf943bdc646ae0e550f7311582f61bd3c443c
      Total Runs: 626
      Metric Samples: Last Run: 13, Total: 8,138
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 626
      Average Execution Time : 56ms
      Instance ID: openmetrics:one-id:a7f9db6bf5c8fae5 [OK]
      Configuration Source: kubelet:docker://c651eea322b9e239ae715cd4b0428c41f2fa678eb642fe36cf763b382f5e2947
      Total Runs: 627
      Metric Samples: Last Run: 239, Total: 149,853
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 627
      Average Execution Time : 188ms
      Instance ID: openmetrics:one-id:aa0c2448f34cb741 [OK]
      Configuration Source: kubelet:docker://a6bd60e32e66bc703aeeda6e3825bd50ab9b189e01f728728c9f1d91aa78cf50
      Total Runs: 626
      Metric Samples: Last Run: 13, Total: 8,138
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 626
      Average Execution Time : 117ms
      Instance ID: openmetrics:one-id:e0441bd923a5b179 [OK]
      Configuration Source: kubelet:docker://99ab07bcc79b4436656415cff0b3ae77faf1ca29e3ce977fb73aceaf28e1e662
      Total Runs: 626
      Metric Samples: Last Run: 224, Total: 140,224
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 626
      Average Execution Time : 234ms
    uptime
    ------
      Instance ID: uptime [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
      Total Runs: 626
      Metric Samples: Last Run: 1, Total: 626
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
========
JMXFetch
========
  Initialized checks
  ==================
    no checks
  Failed checks
  =============
    no checks
=========
Forwarder
=========
  Transactions
  ============
    CheckRunsV1: 626
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 376
    Metadata: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 2,871
    TimeseriesV1: 626
  API Keys status
  ===============
    API key ending with 6f1b6: API Key valid
==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 6f1b6
==========
Logs Agent
==========
  Logs Agent is not running
=========
Aggregator
=========
  Checks Metric Sample: 10.4 M
  Dogstatsd Metric Sample: 45,710
  Event: 717
  Events Flushed: 717
  Number Of Flushes: 626
  Series Flushed: 10.1 M
  Service Check: 20,725
  Service Checks Flushed: 21,335
=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 45,709
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 2.9 M
  Udp Packet Reading Errors: 0
  Udp Packets: 45,710
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0
=====================
Datadog Cluster Agent
=====================
  - Datadog Cluster Agent endpoint detected: https://172.24.134.181:5005
  Successfully connected to the Datadog Cluster Agent.
  - Running: 1.4.0+commit.f102bd8

Additional environment details (Operating System, Cloud provider, etc):

  • Adopt OpenJDK 12.0.2
  • Spring Boot 2.2.4.RELEASE
  • OpenShift v3.11.98

Steps to reproduce the issue:

  1. We are using Spring Boot’s Micrometer instrumentation library to collect OpenMetrics from our webservers. By default, Micrometer ships with an OpenMetrics summary http_server_requests_seconds. We have enabled the Micrometer’s SLA feature to make it a histogram with the following Spring properties:
management:
  metrics:
    distribution:
      sla:
        http.server.requests: 50ms,100ms,250ms,500ms

which produces OpenMetrics such as:

http_server_requests_seconds_bucket{exception="None",method="POST",outcome="SUCCESS",status="200",uri="/oauth2/token",le="0.05",} 9804.0
http_server_requests_seconds_bucket{exception="None",method="POST",outcome="SUCCESS",status="200",uri="/oauth2/token",le="0.1",} 9804.0
http_server_requests_seconds_bucket{exception="None",method="POST",outcome="SUCCESS",status="200",uri="/oauth2/token",le="0.25",} 9804.0
http_server_requests_seconds_bucket{exception="None",method="POST",outcome="SUCCESS",status="200",uri="/oauth2/token",le="0.5",} 9804.0
http_server_requests_seconds_bucket{exception="None",method="POST",outcome="SUCCESS",status="200",uri="/oauth2/token",le="+Inf",} 9804.0
http_server_requests_seconds_count{exception="None",method="POST",outcome="SUCCESS",status="200",uri="/oauth2/token",} 9804.0
http_server_requests_seconds_sum{exception="None",method="POST",outcome="SUCCESS",status="200",uri="/oauth2/token",} 52.225603882
  1. I then wanted to send my metrics as distribution metrics so I can visualize my latency for various percentiles and combinations of tags. As a result, I’ve attached the following annotations to my pod:
    ad.datadoghq.com/signum.check_names: '["openmetrics"]'
    ad.datadoghq.com/signum.init_configs: '[{}]'
    ad.datadoghq.com/signum.instances: |-
      [
        {
          "prometheus_url": "http://%%host%%:8888/actuator/prometheus",
          "namespace": "one-id",
          "metrics": ["http*"],
          "type_overrides": {},
          "send_histogram_buckets": true,
          "send_monotonic_counter": true,
          "send_distribution_buckets": true,
          "send_distribution_counts_as_monotonic": true
        }
      ]
    ad.datadoghq.com/signum.tags: |-
      {
        "team": "keystone",
        "env": "dev",
        "apiVersion": "v1"
      }

Describe the results you received:

This works great! I am able to visualize all the percentiles I want. However, when I enable distribution metrics via send_distribution_buckets I lose my bucket counter metrics i.e. one_id.http_server_requests_seconds.count which had tags like upper_bound and status.

Describe the results you expected:

I was wanting send_distribution_buckets to still send the raw counters as well, because although its more metrics/money I can use them to calculate SLIs (i.e. were 99% of requests succesful/fast) as well as Apdex scores. For example, we have been using this query in our SLO monitor and dashboards:

(per_second(sum:one_id.http_server_requests_seconds.count{upper_bound:0.5,$env,$cluster}.as_count()) - per_second(sum:one_id.http_server_requests_seconds.count{upper_bound:0.5,status:500,$env,$cluster}.as_count()))
/
per_second(sum:one_id.http_server_requests_seconds.count{upper_bound:none,$env,$cluster}.as_count())

Percentile estimations are great for visualizing performance trends and triaging, but they are not great performance indicators for alerting. And so, I’m wondering two things:

  1. Is my approach to getting SLIs and percentiles via OpenMetrics sane? Should I be considering other alternatives?

  2. Is it reasonable to propose (and likely contribute) an additional boolean config var for the openmetrics check that would allow users to opt-in to still sending the raw counters too, something along the lines of send_bucket_counters_with_distributions?

Thanks for the time and help!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
ChristineTChencommented, Nov 5, 2021

Hey @hfuss @adammw @tonglil

Earlier this year, we introduced a new and improved version of the Openmetrics base check, called OpenMetricsBaseCheckV2.

You can enable V2 by using the openmetrics_endpoint instead of prometheus_url. You need to be on Agent 7.26.x+.

As part of Openmetrics V2, we now support collecting histogram counters when distribution metrics is enabled. https://github.com/DataDog/integrations-core/blob/763da121dccfc3e7a3e5d6e9e6c57f8e56e8a1b7/openmetrics/datadog_checks/openmetrics/data/conf.yaml.example#L220-L225

To see the complete list of options, see the conf.yaml.example: https://github.com/DataDog/integrations-core/blob/master/openmetrics/datadog_checks/openmetrics/data/conf.yaml.example

There were some config option changes, so please check if your openmetrics instances use any of these and replace the options (types should be compatible, but you can read more about them in the conf.yaml.example):

  1. type_overrides is incorporated in the metrics option now
  2. ignore_metrics is now exclude_metrics
  3. prometheus_metrics_prefix is now raw_metric_prefix
  4. health_service_check is now enable_health_service_check
  5. labels_mapper is now rename_labels
  6. label_joins is now share_labels
  7. send_histograms_buckets is now collect_histogram_buckets
  8. send_distribution_buckets is now histogram_buckets_as_distributions
2reactions
hfusscommented, Feb 27, 2020

After some consideration, although the workaround ^ is acceptable for the time being I think it would be preferred to avoid the second, unnecessary HTTP request by adding another config var.

I’ll reopen and work on a PR for this to get folks feedback as this should be simple but there’s quite a bit of boolean vars now for openmetrics.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Prometheus and OpenMetrics metrics collection from a host
Collect your exposed Prometheus and OpenMetrics metrics from your application ... and .count when sending histogram buckets as Datadog distribution metrics.
Read more >
Prometheus and OpenMetrics Compatibility - OpenTelemetry
OpenMetrics Exemplars can be attached to Prometheus Histogram bucket metric points and counter metric points. Exemplars on histogram buckets SHOULD be ...
Read more >
Histograms and summaries - Prometheus.io
A straight-forward use of histograms (but not summaries) is to count observations falling into particular buckets of observation values.
Read more >
Prometheus/OpenMetrics V1 - Agent Integrations
Prometheus-based integrations use the OpenMetrics exposition format ... to send the buckets as tagged values when dealing with histograms.
Read more >
A Deep Dive Into the Four Types of Prometheus Metrics
OpenMetrics is another CNCF project that builds upon the Prometheus ... The histogram buckets are exposed as counters using the metric name ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found