Possible confusing logging on ProxyOperationManager.cs
See original GitHub issueUsually when we’ve some issue with coverlet msbuild or .NET tool driver we enable logging to understand if vstest plat killed process after famous 100ms. BTW seem that logging is wrong and also in case of clean shutdown timeout log is emitted
I think that log emitted in line 228 should be emitted only if testHostExited
event was not signaled.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:14 (14 by maintainers)
Top Results From Across the Web
Troubleshoot OMS onboarding issues - Operations Manager
Step 1: Configure the proxy and firewall in local environment · Step 2: Configure the proxy server in the OpsMgr console · Step...
Read more >Forward Proxy vs. Reverse Proxy: The Difference Explained
In this post we dissect the differences between proxy & reverse proxy and explain how admins can use a reverse proxy for easy...
Read more >SCOM SQL queries
This is a list of queries that I know many people find helpful in report writing or understanding the SCOM DB schema's to...
Read more >Proxy Error 502 : The proxy server received an invalid ...
After re-create tables and index it has been fixed. Although it says proxy error, when you look at server log, it shows execute...
Read more >Centralized Logging - an overview
The Enterprise VPN client includes personal firewall software; remote policies create a bootstrap file for clients, and the VPN server performs ProxySecured ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I just wasted four days and found all sorts of issues just because of this 100ms timeout.
@nohwnd I want to dig a bit deeper into your reservations around increasing the timeout to 30 seconds. In the success case, when the process finishes quickly enough, no problem, we’re not waiting long.
But suppose the process is doing some expensive teardown, or there’s just a fluke (like hard drive/antivirus delay, or random noise from other processes on the machine). Suppose the short (100ms, or 1s) timeout expires. Let’s look carefully at what happens. We kill the process. The message that we killed the process is logged into
--diag
file, which is off by default. The user-facing log only displays “Test host process crashed” with no further information. Suppose I even go and manage to collect a dump (which seems like an absolutely impossible endeavour at the moment). The dump would just show that the process was randomly interrupted, and not have any exception or any clue that would help. So users would be left clueless, wasting their time, living with flaky CI. I’m actually investigating this for a partner team who have been living with flaky CI, about half their CI builds fail randomly, and the poor people are just used to requeuing and hoping it works next time. Nobody knows how to investigate this. Honestly, in light of days or weeks of lost productivity of people worldwide, flaky CI builds, saving on this 100ms timeout looks really strange from a business perspective.I see absolutely no harm about increasing the timeout to 30 seconds. It will make the worst case successful, and won’t impact the regular success case. Remember that books are written about timeouts in computers, and how you can never rely on 100ms actually being 100ms. Your entire process may be paused by the OS for any amount of time randomly. Heck, a random GC can easily create a timeout longer than that.
Additionally, this bug should track better user-facing logging in case that timeout does expire. See related: https://github.com/microsoft/vstest/issues/2952
It should be immediately obvious to the user that what happened isn’t that a particular test failed, but we timed out waiting for the host process to terminate.
Actually the best thing would be to use procdump in that case after the timeout, forcibly grab a dump of the process and make sure it’s published as an AzDO artifact, so users can see what the process was doing that caused a 30 seconds timeout.
At the risk of repeating myself, I can’t underscore enough how important it is that we make this set of improvements to the dotnet test experience on CI. It will result in more stable pipelines, remove the flakiness and noise and make failures easier to diagnose. It’s hard to measure the amount of pain the current unfortunate constellation of defects is causing to engineers using dotnet test in CI.
@nohwnd if you feel like any of this work doesn’t fit in your current schedule, please ping me and I will ensure this gets escalated to whatever level is necessary to prioritize this as an absolute top priority.
@nohwnd @AbhitejJohn I have also another proposal, can we increase the timeout to 30 seconds? I mean if that timeout is here to avoid that somebody freeze host process(legit) why don’t use a more “standard” timeout and give a realistic time to allow to host process(usually user test classes, think for instance a class subscribed to process exit that cleanup something and needs more than 100ms) to cleanup? 100ms is very few time, 30 second is standard time if process will be blocked for more that 30 seconds likely there is something wrong with it, but the timeout protection will work.