[🐛 Bug]: Selenium grid 4 hub becomes unresponsive to new sessions when Chrome does not start
See original GitHub issueWhat happened?
This is an issue that we have seen regularly since we began using Selenium Grid 4 a few months ago, and only recently were able to conjecture as to the cause. We have noticed that when Chrome fails to start on one of the nodes, the hub becomes unresponsive to new requests coming into it. The hub continues to respond to its health checks and serves HTTP requests, but no new sessions begin across any of the nodes. We only see this behavior on our selenium grid that receives very frequent traffic and almost always has one or two sessions running on it.
We have unfortunately not been able to reproduce outside of our Kubernetes cluster. We also have not been able to detect a pattern with when the issue occurs. Sometimes it will happen within hours of the selenium grid being redeployed on the cluster, and sometimes we will go days without running into the issue.
We are also unsure of what more can be done to debug the situation.
How can we reproduce the issue?
These are the environment variables that are supplied to the Chrome node containers:
SE_EVENT_BUS_HOST: se4-hub
SE_EVENT_BUS_SUBSCRIBE_PORT: 4443
SE_EVENT_BUS_PUBLISH_PORT: 4442
SE_NODE_MAX_SESSIONS: 2
NODE_HEARTBEAT_PERIOD: 30
NODE_OVERRIDE_MAX_SESSIONS: true
NODE_SESSION_TIMEOUT: 180
These are the environment variables supplied to the Hub:
SESSIONQUEUE_REQUEST_TIMEOUT: 120
SESSIONQUEUE_SESSION_RETRY_INTERVAL: 5
LOGGING_LOG_LEVEL: INFO
SERVER_MAX_THREADS: 64
DISTRIBUTOR_HEALTHCHECK_INTERVAL: 30
Relevant log output
2022-03-23 01:22:34.388 Starting ChromeDriver 99.0.4844.51 (d537ec02474b5afe23684e7963d538896c63ac77-refs/branch-heads/4844@{#875}) on port 52937
2022-03-23 01:25:52.427 08:25:52.426 WARN [SpanWrappedHttpHandler.execute] - Unable to execute request: java.net.ConnectException: Connection refused: localhost/127.0.0.1:52937
2022-03-23 01:25:52.427 java.io.UncheckedIOException: java.net.ConnectException: Connection refused: localhost/127.0.0.1:52937
2022-03-23 01:25:52.427 Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:52937
2022-03-23 01:25:52.427 Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:52937
2022-03-23 01:25:52.428 08:25:52.427 WARN [SeleniumSpanExporter$1.lambda$export$0] - {"traceId": "30203ffeee13a6b855c708470eac460f","eventTime": 1648023952425653053,"eventName": "exception","attributes": {"exception.message": "Unable to execute request: java.net.ConnectException: Connection refused: localhost\u002f127.0.0.1:52937","exception.stacktrace": "java.io.UncheckedIOException: java.net.ConnectException: Connection refused: localhost\u002f127.0.0.1:52937\n\tat org.openqa.selenium.remote.http.netty.NettyHttpHandler.makeCall(NettyHttpHandler.java:80)\n\tat org.openqa.selenium.remote.http.RetryRequest.lambda$apply$6(RetryRequest.java:80)\n\tat net.jodah.failsafe.Functions.lambda$get$0(Functions.java:48)\n\tat net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:66)\n\tat net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:66)\n\tat net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:66)\n\tat net.jodah.failsafe.Execution.executeSync(Execution.java:128)\n\tat net.jodah.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:379)\n\tat net.jodah.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:68)\n\tat org.openqa.selenium.remote.http.RetryRequest.lambda$apply$7(RetryRequest.java:80)\n\tat org.openqa.selenium.remote.http.AddSeleniumUserAgent.lambda$apply$0(AddSeleniumUserAgent.java:42)\n\tat org.openqa.selenium.remote.http.Filter.lambda$andFinally$1(Filter.java:56)\n\tat org.openqa.selenium.remote.http.netty.NettyHttpHandler.execute(NettyHttpHandler.java:51)\n\tat org.openqa.selenium.remote.http.RetryRequest.lambda$apply$6(RetryRequest.java:80)\n\tat net.jodah.failsafe.Functions.lambda$get$0(Functions.java:48)\n\tat net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:66)\n\tat net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:66)\n\tat net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:66)\n\tat net.jodah.failsafe.Execution.executeSync(Execution.java:128)\n\tat net.jodah.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:379)\n\tat net.jodah.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:68)\n\tat org.openqa.selenium.remote.http.RetryRequest.lambda$apply$7(RetryRequest.java:80)\n\tat org.openqa.selenium.remote.http.AddSeleniumUserAgent.lambda$apply$0(AddSeleniumUserAgent.java:42)\n\tat org.openqa.selenium.remote.http.Filter.lambda$andFinally$1(Filter.java:56)\n\tat org.openqa.selenium.remote.http.netty.NettyClient.execute(NettyClient.java:110)\n\tat org.openqa.selenium.remote.tracing.TracedHttpClient.execute(TracedHttpClient.java:55)\n\tat org.openqa.selenium.grid.web.ReverseProxyHandler.execute(ReverseProxyHandler.java:92)\n\tat org.openqa.selenium.grid.node.ProtocolConvertingSession.execute(ProtocolConvertingSession.java:75)\n\tat org.openqa.selenium.grid.node.local.SessionSlot.execute(SessionSlot.java:123)\n\tat org.openqa.selenium.grid.node.local.LocalNode.executeWebDriverCommand(LocalNode.java:393)\n\tat org.openqa.selenium.grid.node.ForwardWebDriverCommand.execute(ForwardWebDriverCommand.java:35)\n\tat org.openqa.selenium.remote.http.Route$PredicatedRoute.handle(Route.java:373)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:68)\n\tat org.openqa.selenium.remote.tracing.SpanWrappedHttpHandler.execute(SpanWrappedHttpHandler.java:86)\n\tat org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:64)\n\tat org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:336)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:68)\n\tat org.openqa.selenium.grid.node.Node.execute(Node.java:240)\n\tat org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:336)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:68)\n\tat org.openqa.selenium.remote.AddWebDriverSpecHeaders.lambda$apply$0(AddWebDriverSpecHeaders.java:35)\n\tat org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)\n\tat org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:64)\n\tat org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)\n\tat org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:64)\n\tat org.openqa.selenium.netty.server.SeleniumHandler.lambda$channelRead0$0(SeleniumHandler.java:44)\n\tat java.base\u002fjava.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base\u002fjava.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base\u002fjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base\u002fjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base\u002fjava.lang.Thread.run(Thread.java:829)\nCaused by: java.net.ConnectException: Connection refused: localhost\u002f127.0.0.1:52937\n\tat org.asynchttpclient.netty.channel.NettyConnectListener.onFailure(NettyConnectListener.java:179)\n\tat org.asynchttpclient.netty.channel.NettyChannelConnector$1.onFailure(NettyChannelConnector.java:108)\n\tat org.asynchttpclient.netty.SimpleChannelFutureListener.operationComplete(SimpleChannelFutureListener.java:28)\n\tat org.asynchttpclient.netty.SimpleChannelFutureListener.operationComplete(SimpleChannelFutureListener.java:20)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)\n\tat io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)\n\tat io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)\n\tat io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)\n\tat io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321)\n\tat io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\t... 1 more\nCaused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost\u002f127.0.0.1:52937\nCaused by: java.net.ConnectException: Connection refused\n\tat java.base\u002fsun.nio.ch.SocketChannelImpl.checkConnect(Native Method)\n\tat java.base\u002fsun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)\n\tat io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)\n\tat io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base\u002fjava.lang.Thread.run(Thread.java:829)\n","exception.type": "java.io.UncheckedIOException","http.flavor": 1,"http.handler_class": "org.openqa.selenium.remote.http.Route$PredicatedRoute","http.host": "se4-hub.k8s.tools.blend.com:4444","http.method": "POST","http.request_content_length": "120","http.scheme": "HTTP","http.target": "\u002fsession\u002f1fc868cc66914a5c0728bc6575e959e9\u002factions","http.user_agent": "webdriver\u002f7.7.4"}}
Operating System
Selenium Grid 4.1.12
Selenium version
Selenium Grid 4.1.12
What are the browser(s) and version(s) where you see this issue?
Chrome
What are the browser driver(s) and version(s) where you see this issue?
ChromeDriver 99.0.4844.51
Are you using Selenium Grid?
4.1.2
Issue Analytics
- State:
- Created a year ago
- Comments:12 (5 by maintainers)
Top GitHub Comments
I agree. In part, I was hoping to get some tips on further things that could be done to debug the situation. I also agree with your guess that Chrome is crashing.
My attempts to put together a minimal reproduction have not been successful. I am able to put together a grid locally with docker-compose and see the same
Connection refused
errors, but theyConnection refused
happens.Looking forward to 4.1.3. We’ll upgrade to it when it is available.
Really hard to know what is going on with the information provided. A wild guess might be that Chrome crashes and the mechanism in 4.1.2 for connection retries kicks in and stays there for an extended time until it realizes it cannot connect.
We have changed the default and this retry mechanism is avoided now, and will be part of 4.1.3. So I can leave this issue open and wait until 4.1.3 so you can try again.