question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Avoid or reduce connection attempts to nodes in fail state

See original GitHub issue

Hi,

Lettuce 4.3.0.Final We have redis-cluster with 3 nodes. Each node has 10 processes. Processes on ports 9000…9004 are master, 9005…9009 are slaves. So, we have 15 masters and 15 slaves.

Lettuce was configured to use ReadFromSlave read implementation. I figured out that it’s dangerous option and it leads to proposal https://github.com/mp911de/lettuce/issues/452

Additionally, lettuce configured to use pretty small request timeout (5ms) and large connection timeout (like 1s).

We lost one node of the cluster. ReadFromSlave forces reads from slaves, which are not available. It leads to exceptions provided in the gist. In PartitionsException.txt file there you can find full partitions state.

Problem: A lot of threads stuck trying to setup connection in block manner to nodes which

  • already have a fail state from Redis logs and lettuce aware of it
  • we constantly get ConnectTimeoutException trying to connect to it

So, lettuce continues to allow many of incoming threads calling get() try to establish a connection to a dead node with the big connection timeout in block manner. This led to server dead in our case.

Proposal: If lettuce sees that node has fail state from Redis cluster + node is actually unavailable from prev attempts - don’t allow a lot of threads to establish new connections and block on it. Quickly throw “RedisException: Cannot determine a partition to read” in nearly all such cases. For example, allow only one thread at a time to establish a connection to a node which is dead currently. For now, I see about one quick RedisException: connection timed out for one long blocking ConnectTimeoutException.

WDYT?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
mp911decommented, Jan 31, 2017

Removing the failed future must happen after completion exactly once. I think it’s a synchronization issue. Ok, then we’ve found a viable solution.

Let’s keep the quiet time after connection failure separate. That’s something which can be built on top of RedisClusterClient that requires slight adjustments to the visibility of connectStateful and connectStatefulAsync methods. This way you can keep track of failed connections per host and cache the resulting future. ClusterNodeConnectionFactory.getOrCreateConnection(…) is per ConnectionKey which also incorporates the connection intent (read/write) but in your case you want to group connections by host/port.

1reaction
mp911decommented, Jan 31, 2017

Good catch, I created #460 for the mentioned bug. Thanks @Spikhalskiy for digging into the issue.

I think there are many approaches that would work. I like the approach to synchronize with CompletableFuture because it follows non-blocking connection initialization. I see two things here:

  1. I don’t like the blocking how it’s today. Neither the fact that one client is blocked during connecting nor that many threads are blocked.
  2. I don’t like that multiple requests are serialized so the last thread pays a penalty with the most waiting time.

These are different problems but still related. For now, I’d like to solve your issue to reduce multiple connection attempts to the same ConnectionKey to at most one.

I think the change isn’t huge:

  1. ClusterNodeConnectionFactory would return a CompletableFuture<StatefulRedisConnection<K, V>> so connection happens asynchronously. 2.ConcurrentHashMap synchronizes on ConnectionKey to guarantee only one connection attempt
  2. getOrCreateConnection() returns early with a future that is used to synchronize
  3. If the connection success: 🎉 🎊
  4. If the connection fails, then it’s required to the failed ConnectionKey exactly once from the map and propagate the connection exception.
  5. Subsequent attempts start at the point as there would have been no connection, so basically go to 1.

The ConcurrentHashMap should be encapsulated with an own type to make the underlying concept more clear.

Does this make sense?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting cluster issue with Event ID 1135
This article helps you diagnose and resolve Event ID 1135, which may be logged during the startup of the Cluster service in Failover ......
Read more >
What is a single point of failure (SPOF) and how to avoid them?
A single point of failure (SPOF) is a potential risk posed by system flaws. See where they can occur, the problems they can...
Read more >
Connection failed to node at - how to solve related issues
Connection failed to node at - how to solve related issues. Prevent & resolve issues, cut down administration time & hardware costs. Block...
Read more >
Troubleshoot node crashes in Amazon OpenSearch Service
One of the nodes in my Amazon OpenSearch Service cluster is down. Why did the node fail and how can I prevent this...
Read more >
Cluster fault detection | Elasticsearch Guide [8.5] | Elastic
Timeouts and failures may be due to network delays or performance problems on the affected nodes. Ensure that net.ipv4.tcp_retries2 is configured properly to ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found