question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"Cluster is not reachable" but still responding

See original GitHub issue

Environment Crate 1.0.2 Amazon Linux AMI 2016.09.0.20160923 x86_64 HVM GP2 3 AWS t2.micro nodes crate.yml:

cluster.name: OE-CRATE node.name: v1.0 | us-west-2c | t2.micro | i-1f2ce4c2 node.zone: us-west-2c path.data: /opt/crate/data cluster.routing.allocation.awareness.attributes: zone cluster.routing.allocation.awareness.force.zone.values: us-west-2a,us-west-2b,us-west-2c gateway.expected_nodes: 3 gateway.recover_after_nodes: 2 discovery.zen.minimum_master_nodes: 2 discovery.zen.ping.multicast.enabled: false discovery.type: ec2 column_policy: strict psql.enabled: true

Symptoms The following error is showing in each node’s Crate logs every few seconds:

[2017-02-07 22:02:35,751][WARN ][transport ] [v1.0 | us-west-2c | t2.micro | i-1f2ce4c2] Received response for a request that has timed out, sent [3451ms] ago, timed out [426ms] ago, action [crate/sql/sys/nodes], node [{v1.0 | us-west-2c | t2.micro | i-1f2ce4c2}{ilG0SOMZTs2NEn-g4vSPfQ}{172.31.9.215}{172.31.9.215:4300}{info.extended.type=sigar, zone=us-west-2c, http_address=172.31.9.215:4200}], id [3138422]

The error was first for node us-west-2b. After restarting Crate services on 2b the issue seemed to be resolved. After a few hours the same error started showing up again, but for 2c.

I attempted to look at the cluster via the web admin UI. It reports that the “Cluster is not reachable”. Regardless, I am still able to run queries against the cluster and my application is still functioning.

image

I enabled the jobs log to look for long running jobs…but no queries from my application even show up in the top 100 longest running queries.

select ended - started, * from sys.jobs_log order by ended - started desc limit 100;

image

I also used dmesg to look for issues and there is nothing that corresponds.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:18 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
christianbadercommented, Feb 10, 2017

@tellezb It seems that you have too much shards distributed on your nodes. This causes a very high CPU load on each of your nodes and may lead to the ping timeout issue that you discovered. Can you provide us your configured table-schema to see on which column you partition? The more shards you have, the more CPU it cost.

However, if you have monitoring insights about the cluster metrics before and after the cluster update it would be nice if you could provide this to us to detect a performance regression.

0reactions
DLT1412commented, Nov 16, 2017

Found the root cause is too many shards in each node, then running out of heap mem. I think we can close this issue now and have some docs about that. Ref: https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cluster Service Stops Responding - Windows Server
Cause. This issue occurs if you pause one node of a server cluster and then you restart the active cluster node. When the...
Read more >
Unified Manager shows "The cluster is not reachable" error ...
The cluster <cluster name> is not reachable. Cluster cannot be reached. Ensure that there is network connectivity to the cluster.
Read more >
Unity Connection 8.x Cluster "not reachable" issue
Hi We have Unity 8.0.3.20000-2 cluster. Publisher and secondary is configured. There is issue in cluster, secondary is always "not reachable" Here I...
Read more >
SQL Server cluster not available after failure of the "main" server
This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the...
Read more >
vSphere HA reports that an agent is in the Agent Unreachable ...
If the host is in a Not Responding state, there is a network problem or a total cluster failure. After you resolve this...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found