java.util.concurrent.TimeoutException thrown at random netty read timeouts with RemoteWebDriver
See original GitHub issue🐛 Bug Report
Netty at random times gets a read timeout at. This happens at different selenium commands ( for example: WebDriver.switchTo().defaultContent, WebElement.click, WebDriver.switchTo().window, WebElement.sendKeys, WebDriver.get, Alert.accept ) and at random in a quite small percentage chance (<1% test cases).
To Reproduce
I don’t have specific steps to reproduce. When our CI runs our test suite of thousands of tests run, about 10 fails at random due to this timeout. I could not reproduce by doing a simple long loop with a few commands on my development workstation.
Timeout details
This timeout always occurs at:
Caused by: java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
at org.asynchttpclient.netty.NettyResponseFuture.get(NettyResponseFuture.java:206)
at org.openqa.selenium.remote.http.netty.NettyHttpHandler.makeCall(NettyHttpHandler.java:65)
I could confirm that it took 3 minutes there, confirming that it is due to the default 3 minutes read timeout the selenium configures the netty with. But the commands that are timing outs would normally run very fast, much less than one second.
Trying the code below in a method called probably thousands times by my test suite, it failed entering the catch. But after it called again driver.switchTo().defaultContent() at the end of the code below it worked. So it seems that although the read timeout happens in netty, it still works normally afterwards.
try
{
driver.switchTo().defaultContent();
}
catch (TimeoutException e)
{
// this should never happen, but started happening at random after updating to selenium 4
// output information to help troubleshoot
System.err.println("TimeoutException thrown while trying to go to defaultContent (stack below). Trying again...");
e.printStackTrace();
try
{
Thread.sleep(5000);
}
catch (InterruptedException e1)
{
}
driver.switchTo().defaultContent();
}
In this case, the stack trace got by the e.printStackTrace()
above was:
org.openqa.selenium.TimeoutException: java.util.concurrent.TimeoutException
Build info: version: '4.0.0-beta-3', revision: '5d108f9a67'
System info: host: '51e5404d333b', ip: '172.18.0.7', os.name: 'Linux', os.arch: 'amd64', os.version: '3.10.0-1127.19.1.el7.x86_64', java.version: '11.0.1'
Driver info: org.openqa.selenium.remote.RemoteWebDriver
Command: [a5e3bf25-ba72-4023-b219-76406cf58660, switchToFrame {id=null}]
Capabilities {acceptInsecureCerts: true, browserName: firefox, browserVersion: 88.0, javascriptEnabled: true, moz:accessibilityChecks: false, moz:buildID: 20210415204500, moz:debuggerAddress: localhost:46562, moz:geckodriverVersion: 0.29.0, moz:headless: false, moz:processID: 9286, moz:profile: /tmp/rust_mozprofileQJRwQP, moz:shutdownTimeout: 60000, moz:useNonSpecCompliantPointerOrigin: false, moz:webdriverClick: true, pageLoadStrategy: normal, platform: LINUX, platformName: LINUX, platformVersion: 3.10.0-1127.19.1.el7.x86_64, rotatable: false, se:cdp: ws://172.18.0.3:4444/sessio..., se:cdpVersion: 85, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify}
Session ID: a5e3bf25-ba72-4023-b219-76406cf58660
at org.openqa.selenium.remote.http.netty.NettyHttpHandler.makeCall(NettyHttpHandler.java:71)
at org.openqa.selenium.remote.http.AddSeleniumUserAgent.lambda$apply$0(AddSeleniumUserAgent.java:42)
at org.openqa.selenium.remote.http.Filter.lambda$andFinally$1(Filter.java:56)
at org.openqa.selenium.remote.http.netty.NettyHttpHandler.execute(NettyHttpHandler.java:51)
at org.openqa.selenium.remote.http.AddSeleniumUserAgent.lambda$apply$0(AddSeleniumUserAgent.java:42)
at org.openqa.selenium.remote.http.Filter.lambda$andFinally$1(Filter.java:56)
at org.openqa.selenium.remote.http.netty.NettyClient.execute(NettyClient.java:103)
at org.openqa.selenium.remote.tracing.TracedHttpClient.execute(TracedHttpClient.java:55)
at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:181)
at org.openqa.selenium.remote.TracedCommandExecutor.execute(TracedCommandExecutor.java:39)
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:619)
at org.openqa.selenium.remote.RemoteWebDriver$RemoteTargetLocator.defaultContent(RemoteWebDriver.java:1097)
(...)
Caused by: java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
at org.asynchttpclient.netty.NettyResponseFuture.get(NettyResponseFuture.java:206)
at org.openqa.selenium.remote.http.netty.NettyHttpHandler.makeCall(NettyHttpHandler.java:65)
... 38 more
Environment
OS: Docker containers inside a CentOS Browser: RemoteWebDriver using Firefox in selenium/standalone-firefox:4.0.0-beta-3-20210426 docker image. Also tried the selenium/standalone-firefox:4.0.0-beta-4-prerelease-20210527 docker image, but the same thing happened. Browser Driver version: RemoteWebDriver from selenium-java 4.0.0-beta-3 Language Bindings version: Java 4.0.0-beta-3 The RemoteWebDriver runs in a container that is running in the same docker host as the browser container. So all network between them is only logical in the same machine. Previously we were using Selenium 2.52, in the same docker host, and never happened anything similar to such timeout.
Do you have any tips about what I can try to fix it or investigate more about this?
Issue Analytics
- State:
- Created 2 years ago
- Reactions:18
- Comments:164 (50 by maintainers)
Top GitHub Comments
Thank you @Cybermaxke for the workaround. This has been included already by @pujagani through #10220.
We are planning a 4.1.2 release for next week (which includes this fix).
@JulienBreton I was checking the attached log and I am confused about how the test is sending those commands. I followed the command for session
f24ae2d1ba0f0d398cdaa093ef912389
…POST '/session/f24ae2d1ba0f0d398cdaa093ef912389/timeouts'
which does not seem to get a response.POST /session/f24ae2d1ba0f0d398cdaa093ef912389/timeouts HTTP/1.1
which does not seem to get a response.DELETE /session/f24ae2d1ba0f0d398cdaa093ef912389 HTTP/1.1
which gets completed at 12:48:04.062DELETE /session/f24ae2d1ba0f0d398cdaa093ef912389 HTTP/1.1
which gets completed at 12:48:03.995.But then…
http://172.18.0.4:5555/session/f24ae2d1ba0f0d398cdaa093ef912389/timeouts
.POST /session/f24ae2d1ba0f0d398cdaa093ef912389/timeouts HTTP/1.1
It seems our HTTP client is doing some extra retries we were not aware of (
NettyRequestSender.retry
) which I will deactivate in a moment. In this case, those automatic retries are happening after the session has been deleted. And yes, we need to switch to another HTTP client soon.Deactivating those automatic retries should reduce or remove the amount of timeouts happening, and the let us rely on the configured
session-timeout
plus the retries that only happen between Hub and Node. I think we will do a new release in less than 1 week.