Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible confusing logging on ProxyOperationManager.cs

See original GitHub issue

Usually when we’ve some issue with coverlet msbuild or .NET tool driver we enable logging to understand if vstest plat killed process after famous 100ms. BTW seem that logging is wrong and also in case of clean shutdown timeout log is emitted

https://github.com/microsoft/vstest/blob/d10bcbb28cc3999bcc12758a41a04b998eb9595b/src/Microsoft.TestPlatform.CrossPlatEngine/Client/ProxyOperationManager.cs#L202-L236

I think that log emitted in line 228 should be emitted only if testHostExited event was not signaled.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:14 (14 by maintainers)

Top GitHub Comments

3reactions

KirillOsenkovcommented, Jun 30, 2021

I just wasted four days and found all sorts of issues just because of this 100ms timeout.

@nohwnd I want to dig a bit deeper into your reservations around increasing the timeout to 30 seconds. In the success case, when the process finishes quickly enough, no problem, we’re not waiting long.

But suppose the process is doing some expensive teardown, or there’s just a fluke (like hard drive/antivirus delay, or random noise from other processes on the machine). Suppose the short (100ms, or 1s) timeout expires. Let’s look carefully at what happens. We kill the process. The message that we killed the process is logged into --diag file, which is off by default. The user-facing log only displays “Test host process crashed” with no further information. Suppose I even go and manage to collect a dump (which seems like an absolutely impossible endeavour at the moment). The dump would just show that the process was randomly interrupted, and not have any exception or any clue that would help. So users would be left clueless, wasting their time, living with flaky CI. I’m actually investigating this for a partner team who have been living with flaky CI, about half their CI builds fail randomly, and the poor people are just used to requeuing and hoping it works next time. Nobody knows how to investigate this. Honestly, in light of days or weeks of lost productivity of people worldwide, flaky CI builds, saving on this 100ms timeout looks really strange from a business perspective.

I see absolutely no harm about increasing the timeout to 30 seconds. It will make the worst case successful, and won’t impact the regular success case. Remember that books are written about timeouts in computers, and how you can never rely on 100ms actually being 100ms. Your entire process may be paused by the OS for any amount of time randomly. Heck, a random GC can easily create a timeout longer than that.

Additionally, this bug should track better user-facing logging in case that timeout does expire. See related: https://github.com/microsoft/vstest/issues/2952

It should be immediately obvious to the user that what happened isn’t that a particular test failed, but we timed out waiting for the host process to terminate.

Actually the best thing would be to use procdump in that case after the timeout, forcibly grab a dump of the process and make sure it’s published as an AzDO artifact, so users can see what the process was doing that caused a 30 seconds timeout.

At the risk of repeating myself, I can’t underscore enough how important it is that we make this set of improvements to the dotnet test experience on CI. It will result in more stable pipelines, remove the flakiness and noise and make failures easier to diagnose. It’s hard to measure the amount of pain the current unfortunate constellation of defects is causing to engineers using dotnet test in CI.

@nohwnd if you feel like any of this work doesn’t fit in your current schedule, please ping me and I will ensure this gets escalated to whatever level is necessary to prioritize this as an absolute top priority.

2reactions

MarcoRossignolicommented, Apr 4, 2020

@nohwnd @AbhitejJohn I have also another proposal, can we increase the timeout to 30 seconds? I mean if that timeout is here to avoid that somebody freeze host process(legit) why don’t use a more “standard” timeout and give a realistic time to allow to host process(usually user test classes, think for instance a class subscribed to process exit that cleanup something and needs more than 100ms) to cleanup? 100ms is very few time, 30 second is standard time if process will be blocked for more that 30 seconds likely there is something wrong with it, but the timeout protection will work.

Top Results From Across the Web

Troubleshoot OMS onboarding issues - Operations Manager

Step 1: Configure the proxy and firewall in local environment · Step 2: Configure the proxy server in the OpsMgr console · Step...

Forward Proxy vs. Reverse Proxy: The Difference Explained

In this post we dissect the differences between proxy & reverse proxy and explain how admins can use a reverse proxy for easy...

SCOM SQL queries

This is a list of queries that I know many people find helpful in report writing or understanding the SCOM DB schema's to...

Proxy Error 502 : The proxy server received an invalid ...

After re-create tables and index it has been fixed. Although it says proxy error, when you look at server log, it shows execute...

Centralized Logging - an overview

The Enterprise VPN client includes personal firewall software; remote policies create a bootstrap file for clients, and the VPN server performs ProxySecured ......