Need to reduce Kusto usage in engsrvprod
See original GitHub issueAs part of investigating https://github.com/dotnet/arcade/issues/10553, I found lots of Kusto connection problems, bursting in the time of the slow API responses. I followed up with the Kusto team and they were fairly shocked by our usage patterns.
We need to discuss this so I am marking with the “Needs Triage” label to discuss this week. Also tagging @ChadNedzlek in case he wants to pile on any useful data.
Relevant IcM ticket: https://portal.microsofticm.com/imp/v3/incidents/details/329798643/home
Below shows how much time our top 11 usages use (in Days.Hours:Minutes:Seconds.Fractional format). I am taking steps presently to significantly reduce what AutoScaleActorService does (should be able to get it under 2 days usage per 10 days) but we are now regularly hitting API timeouts and being either throttled or straight up rejected by Kusto for new connections.
.show queries
| where StartedOn > ago(10d)
| summarize sum(Duration) by Application
| order by sum_Duration desc
Application | sum(Duration) |
---|---|
HelixAPI | 74.03:56:30.1595755 |
AzureDevOpsTestAggregation | 40.21:06:08.2427345 |
AutoScaleActorService | 11.01:12:21.4480691 |
Grafana-ADX | 2.07:25:49.2973983 |
BuildResultAnalysisProcessor | 1.17:16:38.2120386 |
MetricsObserver | 17:04:20.0776551 |
DependencyUpdater | 05:53:59.1044860 |
TestDataAggregationService | 02:40:14.3751389 |
KnownIssuesMonitor | 02:36:19.1245124 |
Maestro.Web | 02:08:57.9728915 |
SQLCleaner | 01:40:15.5044311 |
Checking out a 30-day period:
.show queries
| where StartedOn > ago(30d)
| summarize sum(Duration) by Application, bin(StartedOn, 1d)
| render timechart
… we see that the Helix API and Azure Devops Test aggregation are both moving up in the past couple weeks.
Issue Analytics
- State:
- Created a year ago
- Comments:22 (22 by maintainers)
Top GitHub Comments
This usage all fell off a cliff starting in September, so I am closing the issue. Great work, everyone who contributed an improvement!
FYI AutoScale brought the prod cluster down to its minimum of two instances on September 1 (it was 5 instances at peak).
Good job everyone on getting those Seals under control!