Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Client not able to reconnect instantly when a node gets down then up again with multi-node configuration

See original GitHub issue

🐛 Bug Report

When configured with multiple ES nodes, if all nodes get offline (then online again) at the same time, the client takes time before being able to successfully reconnect

To Reproduce

Using a local 2 node ES cluster (es version 8.2 not that is really matters):

es-node-1 config

cluster.name: test-cluster
node.name: node-1
xpack.security.enabled: false
http.port: 9201
cluster.initial_master_nodes: ["node-1", "node-2"]

es-node-2 config

cluster.name: test-cluster
node.name: node-2
xpack.security.enabled: false
http.port: 9202
cluster.initial_master_nodes: ["node-1", "node-2"]

test script

const elasticsearch = require('@elastic/elasticsearch');

const client = new elasticsearch.Client({
  sniffOnStart: false,
  sniffOnConnectionFault: false,
  sniffInterval: false,
  nodes: ['http://localhost:9201', 'http://localhost:9202'],
});

const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));

const ping = async () => {
  try {
    await client.ping();
    console.log('ping successful');
  } catch (e) {
    console.log('error during ping: ', e.message);
  }
};

const chainPing = async () => {
  await ping();
  await delay(1000);
  chainPing();
}

chainPing();

Scenarios

1. stop node 1, then restart node 1

All ok, all calls successful

2. stop node 2, then restart node 2

All ok, all calls successful

3. stop node 1, stop node 2, restart node 1, restart node 2

When both nodes are down, we see errors error during ping: There are no living connections (which is expected)
Once both nodes get up again, the errors remain for a given period of time for all calls (which should not occur)
From my tests, this period seems to be approximatively 6 minutes

4. stop node 1, restart node 1, stop node 2, restart node 2

stopping and restarting node 1 causes no problem, however as soon as the second node is stopped, we observe the same behavior as the previous scenario: the Client does not retry to connect to node1 instantly, and need approx 6minutes before doing so

Expected behavior

When node goes down and up again, the client should be able to reconnect to them as soon as they’re up and running again, especially if all nodes are considered down

FWIW, this is working as expected for a single-node configuration. When the client is configured with a single node, when the node is down, we get a slightly different error (connect ECONNREFUSED 127.0.0.1:9201), and the client is able to communicate with ES again as soon as the node is back on.

I’m assuming this is because the Client is implementing an eviction strategy and an eviction period in case of down node when configured with multi-node. I also assume the strategy don’t handle the case where all nodes are down at the same time, where it should more actively try to reconnect to at least one node.

Your Environment

node version: 16.14.2
@elastic/elasticsearch version: reproduced on 8.0.0 and elasticsearch-canary@8.2.0-canary.2
os: Mac

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:8 (2 by maintainers)

Top GitHub Comments

1reaction

jodevsacommented, Jul 29, 2022

I’ve opened an issue describing the main problem and the root cause here: https://github.com/elastic/elastic-transport-js/issues/53

0reactions

TinaHeiligerscommented, Jun 16, 2022

I think we collected enough element for a maintainer to pick that issue up eventually.

Agreed.

Top Results From Across the Web

Best Practices of Qlik Sense Multi-Node Setup, Loa... - 1522016

3. If possible start with all mandatory ports required , for proxy too, if you are intended to make your rim node as...

Troubleshoot Cluster Setup | CockroachDB Docs

connection refused error, which indicates you have not included some flag that you used to start the node. We have additional troubleshooting steps...

Troubleshoot node crashes in Amazon OpenSearch Service

Be sure that you have more than one node in your cluster. A single-node cluster is a single point of failure. You can't...

Configure Multi-Node Environment | Confluent Documentation

The required host and IP address is determined based on the data that the broker passes back in the initial connection (e.g. if...

Add and remove nodes in your cluster | Elasticsearch Guide [8.5]

More precisely, if you shut down half or more of the master-eligible nodes all at the same time then the cluster will normally...