Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Need to reduce Kusto usage in engsrvprod

See original GitHub issue

As part of investigating https://github.com/dotnet/arcade/issues/10553, I found lots of Kusto connection problems, bursting in the time of the slow API responses. I followed up with the Kusto team and they were fairly shocked by our usage patterns.

We need to discuss this so I am marking with the “Needs Triage” label to discuss this week. Also tagging @ChadNedzlek in case he wants to pile on any useful data.

Relevant IcM ticket: https://portal.microsofticm.com/imp/v3/incidents/details/329798643/home

Below shows how much time our top 11 usages use (in Days.Hours:Minutes:Seconds.Fractional format). I am taking steps presently to significantly reduce what AutoScaleActorService does (should be able to get it under 2 days usage per 10 days) but we are now regularly hitting API timeouts and being either throttled or straight up rejected by Kusto for new connections.

.show queries 
| where StartedOn > ago(10d)
| summarize sum(Duration) by Application
| order by sum_Duration desc

Application	sum(Duration)
HelixAPI	74.03:56:30.1595755
AzureDevOpsTestAggregation	40.21:06:08.2427345
AutoScaleActorService	11.01:12:21.4480691
Grafana-ADX	2.07:25:49.2973983
BuildResultAnalysisProcessor	1.17:16:38.2120386
MetricsObserver	17:04:20.0776551
DependencyUpdater	05:53:59.1044860
TestDataAggregationService	02:40:14.3751389
KnownIssuesMonitor	02:36:19.1245124
Maestro.Web	02:08:57.9728915
SQLCleaner	01:40:15.5044311

Checking out a 30-day period:

.show queries 
| where StartedOn > ago(30d)
| summarize sum(Duration) by Application, bin(StartedOn, 1d)
| render  timechart

… we see that the Helix API and Azure Devops Test aggregation are both moving up in the past couple weeks.

Issue Analytics

State:
Created a year ago
Comments:22 (22 by maintainers)

Top GitHub Comments

2reactions

MattGalcommented, Sep 12, 2022

This usage all fell off a cliff starting in September, so I am closing the issue. Great work, everyone who contributed an improvement!

1reaction

garathcommented, Sep 13, 2022

FYI AutoScale brought the prod cluster down to its minimum of two instances on September 1 (it was 5 instances at peak).

"properties": {
  "Description": "Optimized Autoscale attempting to scale resource from 3 instances count to 2 instances count.",
  "ScaleReason": "CacheUtilization, CPU, IngestionCapacity, StreamingIngestion and SealsLoadFactor are low",
},

Good job everyone on getting those Seals under control!