[openmetrics] Offer option to send histograms as bucket counters AND distribution metrics
See original GitHub issueOutput of the info page
===============
Agent (v7.16.1)
===============
Status date: 2020-02-25 22:22:05.129654 UTC
Agent start: 2020-02-25 19:45:25.923801 UTC
Pid: 338
Go Version: go1.12.9
Python Version: 3.7.4
Build arch: amd64
Check Runners: 4
Log Level: WARN
Paths
=====
Config File: /etc/datadog-agent/datadog.yaml
conf.d: /etc/datadog-agent/conf.d
checks.d: /etc/datadog-agent/checks.d
Clocks
======
NTP offset: -6.989ms
System UTC time: 2020-02-25 22:22:05.129654 UTC
Host Info
=========
bootTime: 2020-02-10 21:50:08.000000 UTC
kernelVersion: 3.10.0-1062.1.1.el7.x86_64
os: linux
platform: debian
platformFamily: debian
platformVersion: 10.2
procs: 67
uptime: 357h55m27s
virtualizationRole: guest
virtualizationSystem: docker
Hostnames
=========
host_aliases: [ocp-app-01q.lab1.bwnet.us]
hostname: ocp-app-01q.lab1.bwnet.us
socket-fqdn: datadog-agent-jnpb5
socket-hostname: datadog-agent-jnpb5
host tags:
[cluster:cluster.lab1]
hostname provider: container
unused hostname providers:
aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
configuration/environment: hostname is empty
gce: unable to retrieve hostname from GCE: Get http://169.254.169.254/computeMetadata/v1/instance/hostname: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Metadata
========
hostname_source: container
=========
Collector
=========
Running Checks
==============
cpu
---
Instance ID: cpu [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
Total Runs: 627
Metric Samples: Last Run: 6, Total: 3,756
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 0s
disk (2.5.3)
------------
Instance ID: disk:e5dffb8bef24336f [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
Total Runs: 626
Metric Samples: Last Run: 452, Total: 282,952
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 1.378s
docker
------
Instance ID: docker [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/docker.d/conf.yaml.default
Total Runs: 619
Metric Samples: Last Run: 8,820, Total: 1 M
Events: Last Run: 0, Total: 716
Service Checks: Last Run: 1, Total: 619
Average Execution Time : 3.658s
file_handle
-----------
Instance ID: file_handle [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
Total Runs: 626
Metric Samples: Last Run: 5, Total: 3,130
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 0s
io
--
Instance ID: io [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
Total Runs: 626
Metric Samples: Last Run: 455, Total: 284,515
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 3ms
kubelet (3.4.0)
---------------
Instance ID: kubelet:d884b5186b651429 [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
Total Runs: 626
Metric Samples: Last Run: 3,814, Total: 1 M
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 4, Total: 2,504
Average Execution Time : 6.386s
load
----
Instance ID: load [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
Total Runs: 626
Metric Samples: Last Run: 6, Total: 3,756
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 0s
memory
------
Instance ID: memory [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
Total Runs: 626
Metric Samples: Last Run: 17, Total: 10,642
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 0s
network (1.12.2)
----------------
Instance ID: network:e0204ad63d43c949 [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
Total Runs: 626
Metric Samples: Last Run: 31, Total: 19,406
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 105ms
ntp
---
Instance ID: ntp:d884b5186b651429 [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
Total Runs: 11
Metric Samples: Last Run: 1, Total: 11
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 11
Average Execution Time : 981ms
openmetrics (1.3.0)
-------------------
Instance ID: openmetrics:argo:24ca5ec29bda41f3 [OK]
Configuration Source: kubelet:docker://bcc5c94c61083c1fe11de4936f12f733ececa08414e66c44bdd5ac800d673246
Total Runs: 86
Metric Samples: Last Run: 89, Total: 7,627
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 86
Average Execution Time : 29ms
Instance ID: openmetrics:argo:e822becabdb98fb [OK]
Configuration Source: kubelet:docker://d5414e1f44598239e76bef98f8ac623d20596ca8e99fc960c80d827faeb8a208
Total Runs: 158
Metric Samples: Last Run: 84, Total: 13,023
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 158
Average Execution Time : 50ms
Instance ID: openmetrics:jenkins:1cb698262b8693f [OK]
Configuration Source: kubelet:docker://31b523ea583d7a7698dddde960d108432d1db81be4df5e4d7b40f2a77c628a31
Total Runs: 626
Metric Samples: Last Run: 1,223, Total: 765,570
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 626
Average Execution Time : 493ms
Instance ID: openmetrics:jenkins:4098d8357e61abc9 [OK]
Configuration Source: kubelet:docker://5c0eb29d1c1f0a4e9cb19983a5b2d02ccf466ea0b3b92f4d00f6c18f588cd765
Total Runs: 626
Metric Samples: Last Run: 52, Total: 32,552
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 626
Average Execution Time : 161ms
Instance ID: openmetrics:jenkins:6ff158361907b312 [OK]
Configuration Source: kubelet:docker://c4d08888c74a76777b9e2970f0d77b30c5007e85a71e144ba6a5ae2c78f40a47
Total Runs: 626
Metric Samples: Last Run: 1,202, Total: 752,452
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 626
Average Execution Time : 544ms
Instance ID: openmetrics:jenkins:804a23f6b7eb0cea [OK]
Configuration Source: kubelet:docker://fa5867c5238ecd5d3b12e679ca9c5c72108b4c010568137667489ae5a5efb12d
Total Runs: 626
Metric Samples: Last Run: 52, Total: 32,552
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 626
Average Execution Time : 12ms
Instance ID: openmetrics:one-id:6f2950c5f7b7a6ab [OK]
Configuration Source: kubelet:docker://5ed03916f3709d75f249ea1c02daf943bdc646ae0e550f7311582f61bd3c443c
Total Runs: 626
Metric Samples: Last Run: 13, Total: 8,138
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 626
Average Execution Time : 56ms
Instance ID: openmetrics:one-id:a7f9db6bf5c8fae5 [OK]
Configuration Source: kubelet:docker://c651eea322b9e239ae715cd4b0428c41f2fa678eb642fe36cf763b382f5e2947
Total Runs: 627
Metric Samples: Last Run: 239, Total: 149,853
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 627
Average Execution Time : 188ms
Instance ID: openmetrics:one-id:aa0c2448f34cb741 [OK]
Configuration Source: kubelet:docker://a6bd60e32e66bc703aeeda6e3825bd50ab9b189e01f728728c9f1d91aa78cf50
Total Runs: 626
Metric Samples: Last Run: 13, Total: 8,138
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 626
Average Execution Time : 117ms
Instance ID: openmetrics:one-id:e0441bd923a5b179 [OK]
Configuration Source: kubelet:docker://99ab07bcc79b4436656415cff0b3ae77faf1ca29e3ce977fb73aceaf28e1e662
Total Runs: 626
Metric Samples: Last Run: 224, Total: 140,224
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 626
Average Execution Time : 234ms
uptime
------
Instance ID: uptime [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
Total Runs: 626
Metric Samples: Last Run: 1, Total: 626
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 0s
========
JMXFetch
========
Initialized checks
==================
no checks
Failed checks
=============
no checks
=========
Forwarder
=========
Transactions
============
CheckRunsV1: 626
Dropped: 0
DroppedOnInput: 0
Events: 0
HostMetadata: 0
IntakeV1: 376
Metadata: 0
Requeued: 0
Retried: 0
RetryQueueSize: 0
Series: 0
ServiceChecks: 0
SketchSeries: 0
Success: 2,871
TimeseriesV1: 626
API Keys status
===============
API key ending with 6f1b6: API Key valid
==========
Endpoints
==========
https://app.datadoghq.com - API Key ending with:
- 6f1b6
==========
Logs Agent
==========
Logs Agent is not running
=========
Aggregator
=========
Checks Metric Sample: 10.4 M
Dogstatsd Metric Sample: 45,710
Event: 717
Events Flushed: 717
Number Of Flushes: 626
Series Flushed: 10.1 M
Service Check: 20,725
Service Checks Flushed: 21,335
=========
DogStatsD
=========
Event Packets: 0
Event Parse Errors: 0
Metric Packets: 45,709
Metric Parse Errors: 0
Service Check Packets: 0
Service Check Parse Errors: 0
Udp Bytes: 2.9 M
Udp Packet Reading Errors: 0
Udp Packets: 45,710
Uds Bytes: 0
Uds Origin Detection Errors: 0
Uds Packet Reading Errors: 0
Uds Packets: 0
=====================
Datadog Cluster Agent
=====================
- Datadog Cluster Agent endpoint detected: https://172.24.134.181:5005
Successfully connected to the Datadog Cluster Agent.
- Running: 1.4.0+commit.f102bd8
Additional environment details (Operating System, Cloud provider, etc):
- Adopt OpenJDK 12.0.2
- Spring Boot 2.2.4.RELEASE
- OpenShift v3.11.98
Steps to reproduce the issue:
- We are using Spring Boot’s Micrometer instrumentation library to collect OpenMetrics from our webservers. By default, Micrometer ships with an OpenMetrics summary
http_server_requests_seconds
. We have enabled the Micrometer’s SLA feature to make it a histogram with the following Spring properties:
management:
metrics:
distribution:
sla:
http.server.requests: 50ms,100ms,250ms,500ms
which produces OpenMetrics such as:
http_server_requests_seconds_bucket{exception="None",method="POST",outcome="SUCCESS",status="200",uri="/oauth2/token",le="0.05",} 9804.0
http_server_requests_seconds_bucket{exception="None",method="POST",outcome="SUCCESS",status="200",uri="/oauth2/token",le="0.1",} 9804.0
http_server_requests_seconds_bucket{exception="None",method="POST",outcome="SUCCESS",status="200",uri="/oauth2/token",le="0.25",} 9804.0
http_server_requests_seconds_bucket{exception="None",method="POST",outcome="SUCCESS",status="200",uri="/oauth2/token",le="0.5",} 9804.0
http_server_requests_seconds_bucket{exception="None",method="POST",outcome="SUCCESS",status="200",uri="/oauth2/token",le="+Inf",} 9804.0
http_server_requests_seconds_count{exception="None",method="POST",outcome="SUCCESS",status="200",uri="/oauth2/token",} 9804.0
http_server_requests_seconds_sum{exception="None",method="POST",outcome="SUCCESS",status="200",uri="/oauth2/token",} 52.225603882
- I then wanted to send my metrics as distribution metrics so I can visualize my latency for various percentiles and combinations of tags. As a result, I’ve attached the following annotations to my pod:
ad.datadoghq.com/signum.check_names: '["openmetrics"]'
ad.datadoghq.com/signum.init_configs: '[{}]'
ad.datadoghq.com/signum.instances: |-
[
{
"prometheus_url": "http://%%host%%:8888/actuator/prometheus",
"namespace": "one-id",
"metrics": ["http*"],
"type_overrides": {},
"send_histogram_buckets": true,
"send_monotonic_counter": true,
"send_distribution_buckets": true,
"send_distribution_counts_as_monotonic": true
}
]
ad.datadoghq.com/signum.tags: |-
{
"team": "keystone",
"env": "dev",
"apiVersion": "v1"
}
Describe the results you received:
This works great! I am able to visualize all the percentiles I want. However, when I enable distribution metrics via send_distribution_buckets
I lose my bucket counter metrics i.e. one_id.http_server_requests_seconds.count
which had tags like upper_bound
and status
.
Describe the results you expected:
I was wanting send_distribution_buckets
to still send the raw counters as well, because although its more metrics/money I can use them to calculate SLIs (i.e. were 99% of requests succesful/fast) as well as Apdex scores. For example, we have been using this query in our SLO monitor and dashboards:
(per_second(sum:one_id.http_server_requests_seconds.count{upper_bound:0.5,$env,$cluster}.as_count()) - per_second(sum:one_id.http_server_requests_seconds.count{upper_bound:0.5,status:500,$env,$cluster}.as_count()))
/
per_second(sum:one_id.http_server_requests_seconds.count{upper_bound:none,$env,$cluster}.as_count())
Percentile estimations are great for visualizing performance trends and triaging, but they are not great performance indicators for alerting. And so, I’m wondering two things:
-
Is my approach to getting SLIs and percentiles via OpenMetrics sane? Should I be considering other alternatives?
-
Is it reasonable to propose (and likely contribute) an additional boolean config var for the
openmetrics
check that would allow users to opt-in to still sending the raw counters too, something along the lines ofsend_bucket_counters_with_distributions
?
Thanks for the time and help!
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:6 (1 by maintainers)
Top GitHub Comments
Hey @hfuss @adammw @tonglil
Earlier this year, we introduced a new and improved version of the Openmetrics base check, called
OpenMetricsBaseCheckV2
.You can enable V2 by using the
openmetrics_endpoint
instead ofprometheus_url
. You need to be on Agent7.26.x
+.As part of Openmetrics V2, we now support collecting histogram counters when distribution metrics is enabled. https://github.com/DataDog/integrations-core/blob/763da121dccfc3e7a3e5d6e9e6c57f8e56e8a1b7/openmetrics/datadog_checks/openmetrics/data/conf.yaml.example#L220-L225
To see the complete list of options, see the conf.yaml.example: https://github.com/DataDog/integrations-core/blob/master/openmetrics/datadog_checks/openmetrics/data/conf.yaml.example
There were some config option changes, so please check if your openmetrics instances use any of these and replace the options (types should be compatible, but you can read more about them in the conf.yaml.example):
type_overrides
is incorporated in the metrics option nowignore_metrics
is nowexclude_metrics
prometheus_metrics_prefix
is nowraw_metric_prefix
health_service_check
is nowenable_health_service_check
labels_mapper
is nowrename_labels
label_joins
is nowshare_labels
send_histograms_buckets
is nowcollect_histogram_buckets
send_distribution_buckets
is nowhistogram_buckets_as_distributions
After some consideration, although the workaround ^ is acceptable for the time being I think it would be preferred to avoid the second, unnecessary HTTP request by adding another config var.
I’ll reopen and work on a PR for this to get folks feedback as this should be simple but there’s quite a bit of boolean vars now for
openmetrics
.