Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance degradation caused by high CPU usage when Promitor-agent-scraper has to scrape large set of Azure targets

See original GitHub issue

Report

Since 1 year we are using Promitor on AKS (Helm managed) successfully for a subset of our PostgreSQL Azure databases. This setup consists of two deployed Pods for promitor-agent-discovery and promitor-agent-scraper for all PostgreSQL databases. Beginning of this month we wanted to extend this with an additional Pod for Azure targets like our SQLDatabases and VirtualMachines. Main reason is that we use Prometheus Alert Manager as our correlation and alert manager that integrates to our notification platform. Within Prometheus (Thanos) we store the metrics for a longer period.

During the scale out of our Promitor implementation we observed a degradation in performance, which can be related to the number of metrics/targets that are getting scraped. When 500+ metrics are getting scraped each Pod almost constantly needs a cpu limit of more than 1. Next to this everything becomes very unstable and prometheus metrics are constantly broken. Due this we sometimes miss metric points.

The degradation is observed by the following (effects) issues that are caused by this:

CPU consumption is getting very high. Especially during the Azure scrape (collect) runs. We are capping the promitor-agent-scraper on 1 core, but this is always 95-99% consumed.
Due this high CPU the readiness probe doesn’t get in-time response from the health endpoint (API /v1/health). which causes the Pod to restart the container (CrashLoopBackOff). We tried to mitigate this with a TcpSock probe, which helps not to have the Pod constantly restarted.
Next to this also Prometheus gets time-outs on the metrics endpoint (API /metrics), which sometimes just cannot complete the target scrape run. This cause gaps in our metrics, so missing metric entry points (values).

Overall this issue makes Promitor not usable to collect to Azure metrics, since we use this for alerting. Here we have to trust on the quality/integrity of the metric data to deliver reliable alerting and notification.

Expected Behavior

We can use Promitor as a preferred integration for Azure metrics towards Prometheus for (enterprise) scale.

We expect that the CPU usage overall is getting more efficient and maybe running more in parallel.
It would be a potential idea to decouple to multiple containers (failure domain isolation), so Azure scraper, health API and metrics API don’t have impact on each other.

Actual Behavior

CPU consumption is getting very high. Especially during the Azure scrape (collect) runs. We are capping the promitor-agent-scraper on 1 core, but this is always 95-99% consumed.

Due this high CPU the readiness probe doesn’t get in-time response from the health endpoint (API /v1/health). which causes the Pod to restart the container (CrashLoopBackOff). We tried to mitigate this with a TcpSock probe, which helps not to have the Pod constantly restarted.
Next to this also Prometheus gets time-outs on the metrics endpoint (API /metrics), which sometimes just cannot complete the target scrape run. This cause gaps in our metrics, so missing metric entry points (values).

Steps to Reproduce the Problem

Deploy Promitor with the (latest) version and increase the total of metrics. In our case 500+ metrics over multiple Azure resource groups.

Component

Scraper

Version

2.5

Configuration

Take into account this is an ArgoCD manifest with our Helm values into this. This is the vms Pod deployment, but we also see this behavior with other Azure resources like we extended our PostgreSQL targets.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: promitor-agent-scraper-vms
  namespace: prometheus
spec:
  destination:
    namespace: prometheus
    server: https://kubernetes.default.svc
  project: default
  source:
    path: stable/promitor-agent-scraper
    repoURL: https://github.com/example/charts.git
    targetRevision: HEAD
    helm:
      values: |
        nameOverride: promitor-agent-scraper-outsystems
        azureMetadata:
          tenantId: abcd123-1234-1234-abcd-123412341234abcd 
          subscriptionId: abcd123-1234-1234-abcd-123412341234abcd 
          resourceGroupName: Example-Rsg-testvms-01
        azureAuthentication:
          identity:
            id: abcd123-1234-1234-abcd-123412341234abcd 
        resources:
          limits:
            cpu: 1
            memory: 1Gi
          requests:
            cpu: 200m
            memory: 128Mi
        resourceDiscovery:
          enabled: true 
          host: promitor-agent-resource-discovery
          port: 8889 
        
        metricDefaults:
          aggregation:
            interval: 00:05:00
          scraping:
            # Every minute
            schedule: "*/5 * * * *"
        
        secrets:
          createSecret: false
          secretName: "promitor-agent-scraper"
          appIdSecret: azure-app-id
          appKeySecret: azure-app-key
        telemetry:
          defaultLogLevel: trace
          containerLogs:
            isEnabled: true
            verbosity: trace
        
        metrics:  
          - name: azure_sql_database_allocated_data_storage
            description: "Data space allocated"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: allocated_data_storage
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_blocked_by_firewall
            description: "Blocked by Firewall"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: blocked_by_firewall
              aggregation:
                type: Total
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_connection_failed
            description: "Failed Connections"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: connection_failed
              aggregation:
                type: Total
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_connection_successful
            description: "Successful Connections"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: connection_successful
              aggregation:
                type: Total
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_cpu_percent
            description: "CPU percentage"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: cpu_percent
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_deadlock
            description: "Deadlocks"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: deadlock
              aggregation:
                type: Total
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_dtu_consumption_percent
            description: "DTU percentage"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: dtu_consumption_percent
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_dtu_limit
            description: "DTU Limit"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: dtu_limit
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_dtu_used
            description: "DTU used"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: dtu_used
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_log_write_percent
            description: "Log IO percentage"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: log_write_percent
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_physical_data_read_percent
            description: "Data IO percentage"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: physical_data_read_percent
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_sessions_percent
            description: "Sessions percentage"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: sessions_percent
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_sqlserver_process_memory_percent
            description: "SQL Server process memory percent"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: sqlserver_process_memory_percent
              aggregation:
                type: Maximum
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_storage
            description: "Data space used"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: storage
              aggregation:
                type: Maximum
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_storage_percent
            description: "Data space used percent"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: storage_percent
              aggregation:
                type: Maximum
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_tempdb_data_size
            description: "Tempdb Data File Size Kilobytes"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: tempdb_data_size
              aggregation:
                type: Maximum
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_tempdb_log_size
            description: "Tempdb Log File Size Kilobytes"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: tempdb_log_size
              aggregation:
                type: Maximum
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_tempdb_log_used_percent
            description: "Tempdb Percent Log Used"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: tempdb_log_used_percent
              aggregation:
                type: Maximum
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_workers_percent
            description: "Workers percentage"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: workers_percent
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: mssqldb
  
          - name: azure_sql_database_xtp_storage_percent
            description: "In-Memory OLTP storage percent"
            resourceType: SqlDatabase
            azureMetricConfiguration:
              metricName: xtp_storage_percent
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: mssqldb

          - name: azure_vm_available_memory_bytes
            description: "Available Memory Bytes (Preview)"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Available Memory Bytes
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_cpu_credits_consumed
            description: "CPU Credits Consumed"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: CPU Credits Consumed
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_cpu_credits_remaining
            description: "CPU Credits Remaining"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: CPU Credits Remaining
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_data_disk_bandwidth_consumed_percentage
            description: "Data Disk Bandwidth Consumed Percentage"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Data Disk Bandwidth Consumed Percentage
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_data_disk_iops_consumed_percentage
            description: "Data Disk IOPS Consumed Percentage"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Data Disk IOPS Consumed Percentage
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_data_disk_max_burst_bandwidth
            description: "Data Disk Max Burst Bandwidth"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Data Disk Max Burst Bandwidth
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_data_disk_max_burst_iops
            description: "Data Disk Max Burst IOPS"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Data Disk Max Burst IOPS
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_data_disk_queue_depth
            description: "Data Disk Queue Depth"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Data Disk Queue Depth
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_data_disk_read_bytes_sec
            description: "Data Disk Read Bytes/Sec"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Data Disk Read Bytes/sec
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_data_disk_read_operations_sec
            description: "Data Disk Read Operations/Sec"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Data Disk Read Operations/Sec
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_data_disk_target_bandwidth
            description: "Data Disk Target Bandwidth"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Data Disk Target Bandwidth
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_data_disk_target_iops
            description: "Data Disk Target IOPS"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Data Disk Target IOPS
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_data_disk_used_burst_bps_credits_percentage
            description: "Data Disk Used Burst BPS Credits Percentage"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Data Disk Used Burst BPS Credits Percentage
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_data_disk_used_burst_io_credits_percentage
            description: "Data Disk Used Burst IO Credits Percentage"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Data Disk Used Burst IO Credits Percentage
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_data_disk_write_bytes_sec
            description: "Data Disk Write Bytes/Sec"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Data Disk Write Bytes/sec
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_data_disk_write_operations_sec
            description: "Data Disk Write Operations/Sec"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Data Disk Write Operations/Sec
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_disk_read_bytes
            description: "Disk Read Bytes"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Disk Read Bytes
              aggregation:
                type: Total
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_disk_read_operations_sec
            description: "Disk Read Operations/Sec"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Disk Read Operations/Sec
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_disk_write_bytes
            description: "Disk Write Bytes"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Disk Write Bytes
              aggregation:
                type: Total
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_disk_write_operations_sec
            description: "Disk Write Operations/Sec"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Disk Write Operations/Sec
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_inbound_flows
            description: "Inbound Flows"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Inbound Flows
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_inbound_flows_maximum_creation_rate
            description: "Inbound Flows Maximum Creation Rate"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Inbound Flows Maximum Creation Rate
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_network_in
            description: "Network In Billable (Deprecated)"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Network In
              aggregation:
                type: Total
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_network_in_total
            description: "Network In Total"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Network In Total
              aggregation:
                type: Total
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_network_out
            description: "Network Out Billable (Deprecated)"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Network Out
              aggregation:
                type: Total
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_network_out_total
            description: "Network Out Total"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Network Out Total
              aggregation:
                type: Total
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_os_disk_bandwidth_consumed_percentage
            description: "OS Disk Bandwidth Consumed Percentage"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: OS Disk Bandwidth Consumed Percentage
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_os_disk_iops_consumed_percentage
            description: "OS Disk IOPS Consumed Percentage"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: OS Disk IOPS Consumed Percentage
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_os_disk_max_burst_bandwidth
            description: "OS Disk Max Burst Bandwidth"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: OS Disk Max Burst Bandwidth
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_os_disk_max_burst_iops
            description: "OS Disk Max Burst IOPS"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: OS Disk Max Burst IOPS
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_os_disk_queue_depth
            description: "OS Disk Queue Depth"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: OS Disk Queue Depth
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_os_disk_read_bytes_sec
            description: "OS Disk Read Bytes/Sec"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: OS Disk Read Bytes/sec
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_os_disk_read_operations_sec
            description: "OS Disk Read Operations/Sec"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: OS Disk Read Operations/Sec
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_os_disk_target_bandwidth
            description: "OS Disk Target Bandwidth"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: OS Disk Target Bandwidth
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_os_disk_target_iops
            description: "OS Disk Target IOPS"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: OS Disk Target IOPS
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_os_disk_used_burst_bps_credits_percentage
            description: "OS Disk Used Burst BPS Credits Percentage"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: OS Disk Used Burst BPS Credits Percentage
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_os_disk_used_burst_io_credits_percentage
            description: "OS Disk Used Burst IO Credits Percentage"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: OS Disk Used Burst IO Credits Percentage
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_os_disk_write_bytes_sec
            description: "OS Disk Write Bytes/Sec"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: OS Disk Write Bytes/sec
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_os_disk_write_operations_sec
            description: "OS Disk Write Operations/Sec"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: OS Disk Write Operations/Sec
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_outbound_flows
            description: "Outbound Flows"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Outbound Flows
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_outbound_flows_maximum_creation_rate
            description: "Outbound Flows Maximum Creation Rate"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Outbound Flows Maximum Creation Rate
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_percentage_cpu
            description: "Percentage CPU"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Percentage CPU
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_premium_data_disk_cache_read_hit
            description: "Premium Data Disk Cache Read Hit"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Premium Data Disk Cache Read Hit
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_premium_data_disk_cache_read_miss
            description: "Premium Data Disk Cache Read Miss"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Premium Data Disk Cache Read Miss
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_premium_os_disk_cache_read_hit
            description: "Premium OS Disk Cache Read Hit"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Premium OS Disk Cache Read Hit
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_premium_os_disk_cache_read_miss
            description: "Premium OS Disk Cache Read Miss"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: Premium OS Disk Cache Read Miss
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_cached_bandwidth_consumed_percentage
            description: "VM Cached Bandwidth Consumed Percentage"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: VM Cached Bandwidth Consumed Percentage
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_cached_iops_consumed_percentage
            description: "VM Cached IOPS Consumed Percentage"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: VM Cached IOPS Consumed Percentage
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_uncached_bandwidth_consumed_percentage
            description: "VM Uncached Bandwidth Consumed Percentage"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: VM Uncached Bandwidth Consumed Percentage
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  
          - name: azure_vm_uncached_iops_consumed_percentage
            description: "VM Uncached IOPS Consumed Percentage"
            resourceType: VirtualMachine
            azureMetricConfiguration:
              metricName: VM Uncached IOPS Consumed Percentage
              aggregation:
                type: Average
            resourceDiscoveryGroups:
            - name: vms
  syncPolicy:
    automated:
      prune: true
      selfHeal: false

### Logs

Sent this separately. But no FTL are noticed. Functionally everything works.


### Platform

Microsoft Azure

### Contact Details

a.vanwijnbergen@fullstaq.com

Issue Analytics

State:
Created 2 years ago
Reactions:4
Comments:11 (10 by maintainers)

Top GitHub Comments

4reactions

jasonmowrycommented, Apr 22, 2022

Please note that the issue is caused by a lack of controlling parallelism in the Promitor scraping routines. Ultimately, the process management code in Promitor creates tasks for each metric for each resource and then starts them all at once on each iteration of the polling interval. This design probably needs to be revised to maintain a queue of unprocessed work for some defined number of threads to pull from. As it stands now, the current design results in exceeding the limits of CPU and memory for modest virtual machines once hundreds of metrics are in scope for scraping. This seems like an arbitrarily low limit given that clusters can regularly require monitoring thousands of metrics on enterprise scale hosted solutions, and the actual processing required to simply interface with the underlying APIs in question shouldn’t require that much processing power.

1reaction

jasonmowrycommented, Jun 17, 2022

For reference, I’ve replaced the previous PR with this one which seems to now be passing and should satisfy all the previously requested changes: https://github.com/tomkerkhove/promitor/pull/2050