Regression in CPU utilisation with ASP.NET Core 2.2.0 web application
See original GitHub issueDescribe the bug
We have an ASP.NET Core web application that uses MVC to render Razor views that was recently updated from 2.1.6
to 2.2.0
which is showing semi-regular spikes in CPU utilisation where usage can increase by up to 4x for a short period before returning to baseline.
These spikes appear to occur semi-regularly, but inconsistently in time and across instances in a load-balanced fleet.
As series of screenshots that illustrate these CPU behaviours are included below.
The diff for the code and infrastructure changes between the two releases was to update all relevant NuGet package versions to 2.2.0
and to use CompatibilityVersion.Version_2_2
, as well as installing the 2.2.0
runtime and Windows hosting pack.
The only other code change made was to use IHttpMessageHandlerFactory
in a code path that is not in use for the environment configuration where we are observing this issue.
The application uses IIS out-of-process hosting (rather than the new default in-process mode) out of an abundance of caution pending a released fix for #4398 as this issue caused compatibility issues between the API this web application consumes and a separate application running ASP.NET 4.6.1 (see #4437).
In tandem with the above changes, the API that the web application consumes was also updated to use ASP.NET Core 2.2.0 in the same manner, and this application is not observing the same CPU utilisation changes despite its load being mirrored by the fact the web application depends on it.
This naively leads me to the conclusion that there is a regression in ASP.NET Core somewhere in code paths related to Razor views (compared to say, APIs, controllers, model-binding, routing, runtime etc.), hence posting this issue here rather than in coreclr or corefx.
To Reproduce
Render Razor views with ASP.NET Core 2.2.0 on Windows using IIS out-of-process hosting for an extended period of time (more than ~1 hour to get a good chance of observing the CPU spike).
Expected behavior
Steady-state CPU utilisation of the application using 2.2.x
should be comparable with 2.1.x
.
Screenshots
Below are various Grafana chart screenshots that illustrate the issue in our production environment for the web application with commentary.
CPU spiking for a single EC2 instance
This graph shows the CPU usage of a specific EC2 instance of the web application during a peak period of traffic.
Zoom on a specific spike of the single EC2 instance
This graph zooms in on a specific spike from the chart above to show at a higher fidelity.
CPU spiking across EC2 fleet during the same time window
This graph shows the CPU usage of all the EC2 instances of the web application during the same peak period of traffic as the chart above.
Average CPU across the EC2 fleet during the same window (green line) with a -7 day comparison (yellow line)
This graph shows the average CPU usage of all the EC2 instances of the web application during the same peak period of traffic as the charts above.
Average CPU across the fleet for the last 7 days
This graph shows the average CPU usage of all the EC2 instances of the web application during the last seven days, which is the green line. The yellow line shows the same metric for 7 days previously.
The vertical red lines indicate code deployments. The second red line from the line indicates where the 2.2.0 version of the application was deployed. Subsequent lines are business-as-usual code deployments subsequent to the upgrade.
Note: The largest spikes come from CPU utilisation when new EC2 instances come into service from CPU-based auto-scaling and the application is installed onto the fresh instances. These spikes are expected and are separate from the issue being described here.
CPU utilisation per EC2 instance for the last 7 days
This graph is the same as the above, except for all EC2 instances, rather than the overall average of the fleet.
Average CPU across the fleet for the last 7 days for the underlying API
This graph shows the average CPU usage of all the EC2 instances of the API dependency of the web application during the last seven days, which is the green line. The yellow line shows the same metric for 7 days previously.
The vertical red lines indicate code deployments. The second red line from the line indicates where the 2.2.0 version of the application was deployed. Subsequent lines are business-as-usual code deployments subsequent to the upgrade.
This graph serves as a comparison baseline for a different application that is just a HTTP API that is also running 2.2.0, but is unaffected with the same CPU spiking.
Average CPU across the fleet for the last 7 days for the underlying API
This graph is the same as the above, except for all EC2 instances of the API, rather than the overall average of the fleet.
Additional context
Application runs on AWS EC2 c5.large
instances using Windows Server 2012 R2.
More overall context for the application can be found in this blog post from back when we updated it from 2.0.x
to 2.1.0
.
As this is an internal line-of-business application that only appears to reproduce the issue under load, I cannot provide a repro for the issue. However, we’re happy to provider any required telemetry/trace/dumps etc. that may help resolve the root cause to you privately.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:5
- Comments:45 (39 by maintainers)
Yes.
Has the underlying fix for this been ported to 2.2 for servicing yet?