"Cluster is not reachable" but still responding
See original GitHub issueEnvironment Crate 1.0.2 Amazon Linux AMI 2016.09.0.20160923 x86_64 HVM GP2 3 AWS t2.micro nodes crate.yml:
cluster.name: OE-CRATE node.name: v1.0 | us-west-2c | t2.micro | i-1f2ce4c2 node.zone: us-west-2c path.data: /opt/crate/data cluster.routing.allocation.awareness.attributes: zone cluster.routing.allocation.awareness.force.zone.values: us-west-2a,us-west-2b,us-west-2c gateway.expected_nodes: 3 gateway.recover_after_nodes: 2 discovery.zen.minimum_master_nodes: 2 discovery.zen.ping.multicast.enabled: false discovery.type: ec2 column_policy: strict psql.enabled: true
Symptoms The following error is showing in each node’s Crate logs every few seconds:
[2017-02-07 22:02:35,751][WARN ][transport ] [v1.0 | us-west-2c | t2.micro | i-1f2ce4c2] Received response for a request that has timed out, sent [3451ms] ago, timed out [426ms] ago, action [crate/sql/sys/nodes], node [{v1.0 | us-west-2c | t2.micro | i-1f2ce4c2}{ilG0SOMZTs2NEn-g4vSPfQ}{172.31.9.215}{172.31.9.215:4300}{info.extended.type=sigar, zone=us-west-2c, http_address=172.31.9.215:4200}], id [3138422]
The error was first for node us-west-2b. After restarting Crate services on 2b the issue seemed to be resolved. After a few hours the same error started showing up again, but for 2c.
I attempted to look at the cluster via the web admin UI. It reports that the “Cluster is not reachable”. Regardless, I am still able to run queries against the cluster and my application is still functioning.
I enabled the jobs log to look for long running jobs…but no queries from my application even show up in the top 100 longest running queries.
select ended - started, * from sys.jobs_log order by ended - started desc limit 100;
I also used dmesg to look for issues and there is nothing that corresponds.
Issue Analytics
- State:
- Created 7 years ago
- Comments:18 (9 by maintainers)
Top GitHub Comments
@tellezb It seems that you have too much shards distributed on your nodes. This causes a very high CPU load on each of your nodes and may lead to the ping timeout issue that you discovered. Can you provide us your configured table-schema to see on which column you partition? The more shards you have, the more CPU it cost.
However, if you have monitoring insights about the cluster metrics before and after the cluster update it would be nice if you could provide this to us to detect a performance regression.
Found the root cause is too many shards in each node, then running out of heap mem. I think we can close this issue now and have some docs about that. Ref: https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster