question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[🐛 Bug]: Selenium grid 4 hub becomes unresponsive to new sessions when Chrome does not start

See original GitHub issue

What happened?

This is an issue that we have seen regularly since we began using Selenium Grid 4 a few months ago, and only recently were able to conjecture as to the cause. We have noticed that when Chrome fails to start on one of the nodes, the hub becomes unresponsive to new requests coming into it. The hub continues to respond to its health checks and serves HTTP requests, but no new sessions begin across any of the nodes. We only see this behavior on our selenium grid that receives very frequent traffic and almost always has one or two sessions running on it.

We have unfortunately not been able to reproduce outside of our Kubernetes cluster. We also have not been able to detect a pattern with when the issue occurs. Sometimes it will happen within hours of the selenium grid being redeployed on the cluster, and sometimes we will go days without running into the issue.

We are also unsure of what more can be done to debug the situation.

How can we reproduce the issue?

These are the environment variables that are supplied to the Chrome node containers:

      SE_EVENT_BUS_HOST:                  se4-hub
      SE_EVENT_BUS_SUBSCRIBE_PORT:        4443
      SE_EVENT_BUS_PUBLISH_PORT:          4442
      SE_NODE_MAX_SESSIONS:               2
      NODE_HEARTBEAT_PERIOD:              30
      NODE_OVERRIDE_MAX_SESSIONS:         true
      NODE_SESSION_TIMEOUT:               180

These are the environment variables supplied to the Hub:

      SESSIONQUEUE_REQUEST_TIMEOUT:         120
      SESSIONQUEUE_SESSION_RETRY_INTERVAL:  5
      LOGGING_LOG_LEVEL:                    INFO
      SERVER_MAX_THREADS:                   64
      DISTRIBUTOR_HEALTHCHECK_INTERVAL:     30

Relevant log output

2022-03-23 01:22:34.388	Starting ChromeDriver 99.0.4844.51 (d537ec02474b5afe23684e7963d538896c63ac77-refs/branch-heads/4844@{#875}) on port 52937
2022-03-23 01:25:52.427	08:25:52.426 WARN [SpanWrappedHttpHandler.execute] - Unable to execute request: java.net.ConnectException: Connection refused: localhost/127.0.0.1:52937
2022-03-23 01:25:52.427	java.io.UncheckedIOException: java.net.ConnectException: Connection refused: localhost/127.0.0.1:52937
2022-03-23 01:25:52.427	Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:52937
2022-03-23 01:25:52.427	Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:52937
2022-03-23 01:25:52.428	08:25:52.427 WARN [SeleniumSpanExporter$1.lambda$export$0] - {"traceId": "30203ffeee13a6b855c708470eac460f","eventTime": 1648023952425653053,"eventName": "exception","attributes": {"exception.message": "Unable to execute request: java.net.ConnectException: Connection refused: localhost\u002f127.0.0.1:52937","exception.stacktrace": "java.io.UncheckedIOException: java.net.ConnectException: Connection refused: localhost\u002f127.0.0.1:52937\n\tat org.openqa.selenium.remote.http.netty.NettyHttpHandler.makeCall(NettyHttpHandler.java:80)\n\tat org.openqa.selenium.remote.http.RetryRequest.lambda$apply$6(RetryRequest.java:80)\n\tat net.jodah.failsafe.Functions.lambda$get$0(Functions.java:48)\n\tat net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:66)\n\tat net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:66)\n\tat net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:66)\n\tat net.jodah.failsafe.Execution.executeSync(Execution.java:128)\n\tat net.jodah.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:379)\n\tat net.jodah.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:68)\n\tat org.openqa.selenium.remote.http.RetryRequest.lambda$apply$7(RetryRequest.java:80)\n\tat org.openqa.selenium.remote.http.AddSeleniumUserAgent.lambda$apply$0(AddSeleniumUserAgent.java:42)\n\tat org.openqa.selenium.remote.http.Filter.lambda$andFinally$1(Filter.java:56)\n\tat org.openqa.selenium.remote.http.netty.NettyHttpHandler.execute(NettyHttpHandler.java:51)\n\tat org.openqa.selenium.remote.http.RetryRequest.lambda$apply$6(RetryRequest.java:80)\n\tat net.jodah.failsafe.Functions.lambda$get$0(Functions.java:48)\n\tat net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:66)\n\tat net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:66)\n\tat net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:66)\n\tat net.jodah.failsafe.Execution.executeSync(Execution.java:128)\n\tat net.jodah.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:379)\n\tat net.jodah.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:68)\n\tat org.openqa.selenium.remote.http.RetryRequest.lambda$apply$7(RetryRequest.java:80)\n\tat org.openqa.selenium.remote.http.AddSeleniumUserAgent.lambda$apply$0(AddSeleniumUserAgent.java:42)\n\tat org.openqa.selenium.remote.http.Filter.lambda$andFinally$1(Filter.java:56)\n\tat org.openqa.selenium.remote.http.netty.NettyClient.execute(NettyClient.java:110)\n\tat org.openqa.selenium.remote.tracing.TracedHttpClient.execute(TracedHttpClient.java:55)\n\tat org.openqa.selenium.grid.web.ReverseProxyHandler.execute(ReverseProxyHandler.java:92)\n\tat org.openqa.selenium.grid.node.ProtocolConvertingSession.execute(ProtocolConvertingSession.java:75)\n\tat org.openqa.selenium.grid.node.local.SessionSlot.execute(SessionSlot.java:123)\n\tat org.openqa.selenium.grid.node.local.LocalNode.executeWebDriverCommand(LocalNode.java:393)\n\tat org.openqa.selenium.grid.node.ForwardWebDriverCommand.execute(ForwardWebDriverCommand.java:35)\n\tat org.openqa.selenium.remote.http.Route$PredicatedRoute.handle(Route.java:373)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:68)\n\tat org.openqa.selenium.remote.tracing.SpanWrappedHttpHandler.execute(SpanWrappedHttpHandler.java:86)\n\tat org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:64)\n\tat org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:336)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:68)\n\tat org.openqa.selenium.grid.node.Node.execute(Node.java:240)\n\tat org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:336)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:68)\n\tat org.openqa.selenium.remote.AddWebDriverSpecHeaders.lambda$apply$0(AddWebDriverSpecHeaders.java:35)\n\tat org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)\n\tat org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:64)\n\tat org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)\n\tat org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:64)\n\tat org.openqa.selenium.netty.server.SeleniumHandler.lambda$channelRead0$0(SeleniumHandler.java:44)\n\tat java.base\u002fjava.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base\u002fjava.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base\u002fjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base\u002fjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base\u002fjava.lang.Thread.run(Thread.java:829)\nCaused by: java.net.ConnectException: Connection refused: localhost\u002f127.0.0.1:52937\n\tat org.asynchttpclient.netty.channel.NettyConnectListener.onFailure(NettyConnectListener.java:179)\n\tat org.asynchttpclient.netty.channel.NettyChannelConnector$1.onFailure(NettyChannelConnector.java:108)\n\tat org.asynchttpclient.netty.SimpleChannelFutureListener.operationComplete(SimpleChannelFutureListener.java:28)\n\tat org.asynchttpclient.netty.SimpleChannelFutureListener.operationComplete(SimpleChannelFutureListener.java:20)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)\n\tat io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)\n\tat io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)\n\tat io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)\n\tat io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321)\n\tat io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\t... 1 more\nCaused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost\u002f127.0.0.1:52937\nCaused by: java.net.ConnectException: Connection refused\n\tat java.base\u002fsun.nio.ch.SocketChannelImpl.checkConnect(Native Method)\n\tat java.base\u002fsun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)\n\tat io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)\n\tat io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base\u002fjava.lang.Thread.run(Thread.java:829)\n","exception.type": "java.io.UncheckedIOException","http.flavor": 1,"http.handler_class": "org.openqa.selenium.remote.http.Route$PredicatedRoute","http.host": "se4-hub.k8s.tools.blend.com:4444","http.method": "POST","http.request_content_length": "120","http.scheme": "HTTP","http.target": "\u002fsession\u002f1fc868cc66914a5c0728bc6575e959e9\u002factions","http.user_agent": "webdriver\u002f7.7.4"}}

Operating System

Selenium Grid 4.1.12

Selenium version

Selenium Grid 4.1.12

What are the browser(s) and version(s) where you see this issue?

Chrome

What are the browser driver(s) and version(s) where you see this issue?

ChromeDriver 99.0.4844.51

Are you using Selenium Grid?

4.1.2

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
millerickcommented, Mar 25, 2022

Really hard to know what is going on with the information provided.

I agree. In part, I was hoping to get some tips on further things that could be done to debug the situation. I also agree with your guess that Chrome is crashing.

My attempts to put together a minimal reproduction have not been successful. I am able to put together a grid locally with docker-compose and see the same Connection refused errors, but they

  1. are not as delayed as they are when running on our Kubernetes cluster. On our Kubernetes cluster, there is always 110ish seconds between the log for Chrome starting and when the Connection refused happens.
  2. do not cause the hub to become unresponsive to new sessions

Looking forward to 4.1.3. We’ll upgrade to it when it is available.

1reaction
diemolcommented, Mar 25, 2022

Really hard to know what is going on with the information provided. A wild guess might be that Chrome crashes and the mechanism in 4.1.2 for connection retries kicks in and stays there for an extended time until it realizes it cannot connect.

We have changed the default and this retry mechanism is avoided now, and will be part of 4.1.3. So I can leave this issue open and wait until 4.1.3 so you can try again.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Selenium grid 4 : Could not start a new session. Possible ...
Selenium grid 4 : Could not start a new session. Possible causes are invalid address of the remote server or browser start-up failure...
Read more >
Selenoid - A cross browser Selenium solution for Docker
Selenoid is a powerful Golang implementation of original Selenium hub code ... Selenoid does not start: open config/browsers.json: no such file or directory....
Read more >
Session timeout in BrowserStack Automate
The browser could have become unresponsive for several reasons, including but not limited to unhandled pop-ups. We recommend you to go through the...
Read more >
Troubleshooting Device Farm desktop browser testing
You have too many open sessions, are opening too many sessions per second, or are making too many WebDriver requests per second. This...
Read more >
Selenium Grid - Provar
If there is no match, the hub returns an error. There should be only one hub in a Grid. What Is a Node?...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found