Nodes not able to discover themselves using aws discovery in 1.2.0
See original GitHub issueCrateDB version: 1.2.0
JVM version: 1.8.0_121-b13
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
OS version / environment description: Amazon Linux ami-920798f2 (The official Crate 1.2.0 AMI). We are running 3 m4.medium servers to test the upgrade from 1.1.1 to 1.2.0. These are all located in the same availability zone, in the same region. They share the same security group, IAM user, and a matching tag.
The only modifications to crate.yml are the following (for both 1.1.1 and 1.2.0):
discovery.type: ec2
discovery.ec2.groups: dev-crate-instance-group
discovery.ec2.tag.crate-env: dev
discovery.zen.minimum_master_nodes: 2
gateway:
recover_after_nodes: 3
recover_after_time: 5m
expected_nodes: 3
Problem description:
When the nodes are started in version 1.2.0 the cannot complete the discovery process. We see the following error show up in the startup logs:
[ec2-user@ip-172-30-1-180 ~]$ sudo tail -f /var/log/crate/crate.log
[2017-04-26T20:21:50,178][INFO ][i.c.rest ] [Montagne Durbonas] Elasticsearch HTTP REST API not enabled
[2017-04-26T20:21:50,195][INFO ][o.e.b.BootstrapProxy$1 ] [Montagne Durbonas] initialized
[2017-04-26T20:21:50,195][INFO ][o.e.n.Node ] [Montagne Durbonas] starting ...
[2017-04-26T20:21:50,247][INFO ][psql ] [Montagne Durbonas] publish_address {127.0.0.1:5432}, bound_addresses {127.0.0.1:5432}
[2017-04-26T20:21:50,247][INFO ][i.c.b.BlobService ] [Montagne Durbonas] BlobService.doStart() io.crate.blob.BlobService@1a8e44fe
[2017-04-26T20:21:50,272][INFO ][o.e.h.HttpServer ] [Montagne Durbonas] publish_address {127.0.0.1:4200}, bound_addresses {[::1]:4200}, {127.0.0.1:4200}
[2017-04-26T20:21:50,303][INFO ][o.e.t.TransportService ] [Montagne Durbonas] publish_address {127.0.0.1:4300}, bound_addresses {[::1]:4300}, {127.0.0.1:4300}
[2017-04-26T20:22:20,336][WARN ][o.e.n.Node ] [Montagne Durbonas] timed out while waiting for initial discovery state - timeout: 30s
[2017-04-26T20:22:20,336][INFO ][o.e.n.Node ] [Montagne Durbonas] started
This occurs on all nodes. We’ve double checked that they can reach each other and they can over both 4200 and 4300.
If we attempt to curl the index page of one of the nodes we get a response like this:
{
"ok" : false,
"status" : 503,
"name" : "Elm",
"cluster_name" : "crate",
"version" : {
"number" : "1.2.0",
"build_hash" : "af006fa24762e47da523e09181258a7a3cda5849",
"build_timestamp" : "2017-04-24T11:58:22Z",
"build_snapshot" : false,
"es_version" : "5.0.2",
"lucene_version" : "6.2.1"
}
}
After we were unable to upgrade an existing cluster we tried to roll out a brand new cluster with the AMI mentioned above and experienced the exact same issue. Rolling back to 1.1.1 and the cluster came up fine including discovering themselves.
Steps to reproduce:
- Launch 2 or more ami-920798f2 in the same region, subnet.
- Ensure that they have the same security group, one shared tag and that the security group allows traffic on 4300 between themselves. Also ensure they have a shared IAM user with ec2 describe instances.
- Modify the /etc/crate/crate.yml with the aforementioned configuration.
- The nodes should not be able to discover themselves.
Thank you for the help and please let me know if I need to add more details anywhere.
Issue Analytics
- State:
- Created 6 years ago
- Comments:10 (4 by maintainers)
Top GitHub Comments
hi @petreboy14 I’ve been looking at the relevant code and discovered that the automatic detection of the region has accidentally been removed with the 1.2 update 😦 we’ll provide a fix for that in the next 1.2.x release
@chaudum great news! Looking forward.