Performance degradation caused by high CPU usage when Promitor-agent-scraper has to scrape large set of Azure targets
See original GitHub issueReport
Since 1 year we are using Promitor on AKS (Helm managed) successfully for a subset of our PostgreSQL Azure databases. This setup consists of two deployed Pods for promitor-agent-discovery and promitor-agent-scraper for all PostgreSQL databases. Beginning of this month we wanted to extend this with an additional Pod for Azure targets like our SQLDatabases and VirtualMachines. Main reason is that we use Prometheus Alert Manager as our correlation and alert manager that integrates to our notification platform. Within Prometheus (Thanos) we store the metrics for a longer period.
During the scale out of our Promitor implementation we observed a degradation in performance, which can be related to the number of metrics/targets that are getting scraped. When 500+ metrics are getting scraped each Pod almost constantly needs a cpu limit of more than 1. Next to this everything becomes very unstable and prometheus metrics are constantly broken. Due this we sometimes miss metric points.
The degradation is observed by the following (effects) issues that are caused by this:
- CPU consumption is getting very high. Especially during the Azure scrape (collect) runs. We are capping the promitor-agent-scraper on 1 core, but this is always 95-99% consumed.
- Due this high CPU the readiness probe doesn’t get in-time response from the health endpoint (API /v1/health). which causes the Pod to restart the container (CrashLoopBackOff). We tried to mitigate this with a TcpSock probe, which helps not to have the Pod constantly restarted.
- Next to this also Prometheus gets time-outs on the metrics endpoint (API /metrics), which sometimes just cannot complete the target scrape run. This cause gaps in our metrics, so missing metric entry points (values).
Overall this issue makes Promitor not usable to collect to Azure metrics, since we use this for alerting. Here we have to trust on the quality/integrity of the metric data to deliver reliable alerting and notification.
Expected Behavior
We can use Promitor as a preferred integration for Azure metrics towards Prometheus for (enterprise) scale.
- We expect that the CPU usage overall is getting more efficient and maybe running more in parallel.
- It would be a potential idea to decouple to multiple containers (failure domain isolation), so Azure scraper, health API and metrics API don’t have impact on each other.
Actual Behavior
CPU consumption is getting very high. Especially during the Azure scrape (collect) runs. We are capping the promitor-agent-scraper on 1 core, but this is always 95-99% consumed.
- Due this high CPU the readiness probe doesn’t get in-time response from the health endpoint (API /v1/health). which causes the Pod to restart the container (CrashLoopBackOff). We tried to mitigate this with a TcpSock probe, which helps not to have the Pod constantly restarted.
- Next to this also Prometheus gets time-outs on the metrics endpoint (API /metrics), which sometimes just cannot complete the target scrape run. This cause gaps in our metrics, so missing metric entry points (values).
Steps to Reproduce the Problem
Deploy Promitor with the (latest) version and increase the total of metrics. In our case 500+ metrics over multiple Azure resource groups.
Component
Scraper
Version
2.5
Configuration
Take into account this is an ArgoCD manifest with our Helm values into this. This is the vms Pod deployment, but we also see this behavior with other Azure resources like we extended our PostgreSQL targets.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: promitor-agent-scraper-vms
namespace: prometheus
spec:
destination:
namespace: prometheus
server: https://kubernetes.default.svc
project: default
source:
path: stable/promitor-agent-scraper
repoURL: https://github.com/example/charts.git
targetRevision: HEAD
helm:
values: |
nameOverride: promitor-agent-scraper-outsystems
azureMetadata:
tenantId: abcd123-1234-1234-abcd-123412341234abcd
subscriptionId: abcd123-1234-1234-abcd-123412341234abcd
resourceGroupName: Example-Rsg-testvms-01
azureAuthentication:
identity:
id: abcd123-1234-1234-abcd-123412341234abcd
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 200m
memory: 128Mi
resourceDiscovery:
enabled: true
host: promitor-agent-resource-discovery
port: 8889
metricDefaults:
aggregation:
interval: 00:05:00
scraping:
# Every minute
schedule: "*/5 * * * *"
secrets:
createSecret: false
secretName: "promitor-agent-scraper"
appIdSecret: azure-app-id
appKeySecret: azure-app-key
telemetry:
defaultLogLevel: trace
containerLogs:
isEnabled: true
verbosity: trace
metrics:
- name: azure_sql_database_allocated_data_storage
description: "Data space allocated"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: allocated_data_storage
aggregation:
type: Average
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_blocked_by_firewall
description: "Blocked by Firewall"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: blocked_by_firewall
aggregation:
type: Total
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_connection_failed
description: "Failed Connections"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: connection_failed
aggregation:
type: Total
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_connection_successful
description: "Successful Connections"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: connection_successful
aggregation:
type: Total
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_cpu_percent
description: "CPU percentage"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: cpu_percent
aggregation:
type: Average
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_deadlock
description: "Deadlocks"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: deadlock
aggregation:
type: Total
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_dtu_consumption_percent
description: "DTU percentage"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: dtu_consumption_percent
aggregation:
type: Average
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_dtu_limit
description: "DTU Limit"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: dtu_limit
aggregation:
type: Average
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_dtu_used
description: "DTU used"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: dtu_used
aggregation:
type: Average
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_log_write_percent
description: "Log IO percentage"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: log_write_percent
aggregation:
type: Average
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_physical_data_read_percent
description: "Data IO percentage"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: physical_data_read_percent
aggregation:
type: Average
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_sessions_percent
description: "Sessions percentage"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: sessions_percent
aggregation:
type: Average
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_sqlserver_process_memory_percent
description: "SQL Server process memory percent"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: sqlserver_process_memory_percent
aggregation:
type: Maximum
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_storage
description: "Data space used"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: storage
aggregation:
type: Maximum
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_storage_percent
description: "Data space used percent"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: storage_percent
aggregation:
type: Maximum
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_tempdb_data_size
description: "Tempdb Data File Size Kilobytes"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: tempdb_data_size
aggregation:
type: Maximum
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_tempdb_log_size
description: "Tempdb Log File Size Kilobytes"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: tempdb_log_size
aggregation:
type: Maximum
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_tempdb_log_used_percent
description: "Tempdb Percent Log Used"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: tempdb_log_used_percent
aggregation:
type: Maximum
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_workers_percent
description: "Workers percentage"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: workers_percent
aggregation:
type: Average
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_sql_database_xtp_storage_percent
description: "In-Memory OLTP storage percent"
resourceType: SqlDatabase
azureMetricConfiguration:
metricName: xtp_storage_percent
aggregation:
type: Average
resourceDiscoveryGroups:
- name: mssqldb
- name: azure_vm_available_memory_bytes
description: "Available Memory Bytes (Preview)"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Available Memory Bytes
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_cpu_credits_consumed
description: "CPU Credits Consumed"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: CPU Credits Consumed
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_cpu_credits_remaining
description: "CPU Credits Remaining"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: CPU Credits Remaining
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_data_disk_bandwidth_consumed_percentage
description: "Data Disk Bandwidth Consumed Percentage"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Data Disk Bandwidth Consumed Percentage
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_data_disk_iops_consumed_percentage
description: "Data Disk IOPS Consumed Percentage"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Data Disk IOPS Consumed Percentage
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_data_disk_max_burst_bandwidth
description: "Data Disk Max Burst Bandwidth"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Data Disk Max Burst Bandwidth
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_data_disk_max_burst_iops
description: "Data Disk Max Burst IOPS"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Data Disk Max Burst IOPS
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_data_disk_queue_depth
description: "Data Disk Queue Depth"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Data Disk Queue Depth
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_data_disk_read_bytes_sec
description: "Data Disk Read Bytes/Sec"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Data Disk Read Bytes/sec
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_data_disk_read_operations_sec
description: "Data Disk Read Operations/Sec"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Data Disk Read Operations/Sec
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_data_disk_target_bandwidth
description: "Data Disk Target Bandwidth"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Data Disk Target Bandwidth
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_data_disk_target_iops
description: "Data Disk Target IOPS"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Data Disk Target IOPS
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_data_disk_used_burst_bps_credits_percentage
description: "Data Disk Used Burst BPS Credits Percentage"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Data Disk Used Burst BPS Credits Percentage
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_data_disk_used_burst_io_credits_percentage
description: "Data Disk Used Burst IO Credits Percentage"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Data Disk Used Burst IO Credits Percentage
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_data_disk_write_bytes_sec
description: "Data Disk Write Bytes/Sec"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Data Disk Write Bytes/sec
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_data_disk_write_operations_sec
description: "Data Disk Write Operations/Sec"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Data Disk Write Operations/Sec
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_disk_read_bytes
description: "Disk Read Bytes"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Disk Read Bytes
aggregation:
type: Total
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_disk_read_operations_sec
description: "Disk Read Operations/Sec"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Disk Read Operations/Sec
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_disk_write_bytes
description: "Disk Write Bytes"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Disk Write Bytes
aggregation:
type: Total
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_disk_write_operations_sec
description: "Disk Write Operations/Sec"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Disk Write Operations/Sec
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_inbound_flows
description: "Inbound Flows"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Inbound Flows
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_inbound_flows_maximum_creation_rate
description: "Inbound Flows Maximum Creation Rate"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Inbound Flows Maximum Creation Rate
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_network_in
description: "Network In Billable (Deprecated)"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Network In
aggregation:
type: Total
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_network_in_total
description: "Network In Total"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Network In Total
aggregation:
type: Total
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_network_out
description: "Network Out Billable (Deprecated)"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Network Out
aggregation:
type: Total
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_network_out_total
description: "Network Out Total"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Network Out Total
aggregation:
type: Total
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_os_disk_bandwidth_consumed_percentage
description: "OS Disk Bandwidth Consumed Percentage"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: OS Disk Bandwidth Consumed Percentage
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_os_disk_iops_consumed_percentage
description: "OS Disk IOPS Consumed Percentage"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: OS Disk IOPS Consumed Percentage
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_os_disk_max_burst_bandwidth
description: "OS Disk Max Burst Bandwidth"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: OS Disk Max Burst Bandwidth
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_os_disk_max_burst_iops
description: "OS Disk Max Burst IOPS"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: OS Disk Max Burst IOPS
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_os_disk_queue_depth
description: "OS Disk Queue Depth"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: OS Disk Queue Depth
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_os_disk_read_bytes_sec
description: "OS Disk Read Bytes/Sec"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: OS Disk Read Bytes/sec
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_os_disk_read_operations_sec
description: "OS Disk Read Operations/Sec"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: OS Disk Read Operations/Sec
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_os_disk_target_bandwidth
description: "OS Disk Target Bandwidth"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: OS Disk Target Bandwidth
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_os_disk_target_iops
description: "OS Disk Target IOPS"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: OS Disk Target IOPS
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_os_disk_used_burst_bps_credits_percentage
description: "OS Disk Used Burst BPS Credits Percentage"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: OS Disk Used Burst BPS Credits Percentage
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_os_disk_used_burst_io_credits_percentage
description: "OS Disk Used Burst IO Credits Percentage"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: OS Disk Used Burst IO Credits Percentage
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_os_disk_write_bytes_sec
description: "OS Disk Write Bytes/Sec"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: OS Disk Write Bytes/sec
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_os_disk_write_operations_sec
description: "OS Disk Write Operations/Sec"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: OS Disk Write Operations/Sec
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_outbound_flows
description: "Outbound Flows"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Outbound Flows
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_outbound_flows_maximum_creation_rate
description: "Outbound Flows Maximum Creation Rate"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Outbound Flows Maximum Creation Rate
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_percentage_cpu
description: "Percentage CPU"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Percentage CPU
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_premium_data_disk_cache_read_hit
description: "Premium Data Disk Cache Read Hit"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Premium Data Disk Cache Read Hit
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_premium_data_disk_cache_read_miss
description: "Premium Data Disk Cache Read Miss"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Premium Data Disk Cache Read Miss
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_premium_os_disk_cache_read_hit
description: "Premium OS Disk Cache Read Hit"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Premium OS Disk Cache Read Hit
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_premium_os_disk_cache_read_miss
description: "Premium OS Disk Cache Read Miss"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: Premium OS Disk Cache Read Miss
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_cached_bandwidth_consumed_percentage
description: "VM Cached Bandwidth Consumed Percentage"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: VM Cached Bandwidth Consumed Percentage
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_cached_iops_consumed_percentage
description: "VM Cached IOPS Consumed Percentage"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: VM Cached IOPS Consumed Percentage
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_uncached_bandwidth_consumed_percentage
description: "VM Uncached Bandwidth Consumed Percentage"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: VM Uncached Bandwidth Consumed Percentage
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
- name: azure_vm_uncached_iops_consumed_percentage
description: "VM Uncached IOPS Consumed Percentage"
resourceType: VirtualMachine
azureMetricConfiguration:
metricName: VM Uncached IOPS Consumed Percentage
aggregation:
type: Average
resourceDiscoveryGroups:
- name: vms
syncPolicy:
automated:
prune: true
selfHeal: false
### Logs
Sent this separately. But no FTL are noticed. Functionally everything works.
### Platform
Microsoft Azure
### Contact Details
a.vanwijnbergen@fullstaq.com
Issue Analytics
- State:
- Created 2 years ago
- Reactions:4
- Comments:11 (10 by maintainers)

Top Related StackOverflow Question
Please note that the issue is caused by a lack of controlling parallelism in the Promitor scraping routines. Ultimately, the process management code in Promitor creates tasks for each metric for each resource and then starts them all at once on each iteration of the polling interval. This design probably needs to be revised to maintain a queue of unprocessed work for some defined number of threads to pull from. As it stands now, the current design results in exceeding the limits of CPU and memory for modest virtual machines once hundreds of metrics are in scope for scraping. This seems like an arbitrarily low limit given that clusters can regularly require monitoring thousands of metrics on enterprise scale hosted solutions, and the actual processing required to simply interface with the underlying APIs in question shouldn’t require that much processing power.
For reference, I’ve replaced the previous PR with this one which seems to now be passing and should satisfy all the previously requested changes: https://github.com/tomkerkhove/promitor/pull/2050