Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"Cluster is not reachable" but still responding

See original GitHub issue

Environment Crate 1.0.2 Amazon Linux AMI 2016.09.0.20160923 x86_64 HVM GP2 3 AWS t2.micro nodes crate.yml:

cluster.name: OE-CRATE node.name: v1.0 | us-west-2c | t2.micro | i-1f2ce4c2 node.zone: us-west-2c path.data: /opt/crate/data cluster.routing.allocation.awareness.attributes: zone cluster.routing.allocation.awareness.force.zone.values: us-west-2a,us-west-2b,us-west-2c gateway.expected_nodes: 3 gateway.recover_after_nodes: 2 discovery.zen.minimum_master_nodes: 2 discovery.zen.ping.multicast.enabled: false discovery.type: ec2 column_policy: strict psql.enabled: true

Symptoms The following error is showing in each node’s Crate logs every few seconds:

[2017-02-07 22:02:35,751][WARN ][transport ] [v1.0 | us-west-2c | t2.micro | i-1f2ce4c2] Received response for a request that has timed out, sent [3451ms] ago, timed out [426ms] ago, action [crate/sql/sys/nodes], node [{v1.0 | us-west-2c | t2.micro | i-1f2ce4c2}{ilG0SOMZTs2NEn-g4vSPfQ}{172.31.9.215}{172.31.9.215:4300}{info.extended.type=sigar, zone=us-west-2c, http_address=172.31.9.215:4200}], id [3138422]

The error was first for node us-west-2b. After restarting Crate services on 2b the issue seemed to be resolved. After a few hours the same error started showing up again, but for 2c.

I attempted to look at the cluster via the web admin UI. It reports that the “Cluster is not reachable”. Regardless, I am still able to run queries against the cluster and my application is still functioning.

I enabled the jobs log to look for long running jobs…but no queries from my application even show up in the top 100 longest running queries.

select ended - started, * from sys.jobs_log order by ended - started desc limit 100;

I also used dmesg to look for issues and there is nothing that corresponds.

Issue Analytics

State:
Created 7 years ago
Comments:18 (9 by maintainers)

Top GitHub Comments

1reaction

christianbadercommented, Feb 10, 2017

@tellezb It seems that you have too much shards distributed on your nodes. This causes a very high CPU load on each of your nodes and may lead to the ping timeout issue that you discovered. Can you provide us your configured table-schema to see on which column you partition? The more shards you have, the more CPU it cost.

However, if you have monitoring insights about the cluster metrics before and after the cluster update it would be nice if you could provide this to us to detect a performance regression.

0reactions

DLT1412commented, Nov 16, 2017

Found the root cause is too many shards in each node, then running out of heap mem. I think we can close this issue now and have some docs about that. Ref: https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

Top Results From Across the Web

Cluster Service Stops Responding - Windows Server

Cause. This issue occurs if you pause one node of a server cluster and then you restart the active cluster node. When the...

Unified Manager shows "The cluster is not reachable" error ...

The cluster <cluster name> is not reachable. Cluster cannot be reached. Ensure that there is network connectivity to the cluster.

Unity Connection 8.x Cluster "not reachable" issue

Hi We have Unity 8.0.3.20000-2 cluster. Publisher and secondary is configured. There is issue in cluster, secondary is always "not reachable" Here I...

SQL Server cluster not available after failure of the "main" server

This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the...

vSphere HA reports that an agent is in the Agent Unreachable ...

If the host is in a Not Responding state, there is a network problem or a total cluster failure. After you resolve this...