question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Nodes not able to discover themselves using aws discovery in 1.2.0

See original GitHub issue

CrateDB version: 1.2.0

JVM version: 1.8.0_121-b13

openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)

OS version / environment description: Amazon Linux ami-920798f2 (The official Crate 1.2.0 AMI). We are running 3 m4.medium servers to test the upgrade from 1.1.1 to 1.2.0. These are all located in the same availability zone, in the same region. They share the same security group, IAM user, and a matching tag.

The only modifications to crate.yml are the following (for both 1.1.1 and 1.2.0):

discovery.type: ec2
discovery.ec2.groups: dev-crate-instance-group
discovery.ec2.tag.crate-env: dev
discovery.zen.minimum_master_nodes: 2
gateway:
  recover_after_nodes: 3
  recover_after_time: 5m
  expected_nodes: 3

Problem description:

When the nodes are started in version 1.2.0 the cannot complete the discovery process. We see the following error show up in the startup logs:

[ec2-user@ip-172-30-1-180 ~]$ sudo tail -f /var/log/crate/crate.log
[2017-04-26T20:21:50,178][INFO ][i.c.rest                 ] [Montagne Durbonas] Elasticsearch HTTP REST API not enabled
[2017-04-26T20:21:50,195][INFO ][o.e.b.BootstrapProxy$1   ] [Montagne Durbonas] initialized
[2017-04-26T20:21:50,195][INFO ][o.e.n.Node               ] [Montagne Durbonas] starting ...
[2017-04-26T20:21:50,247][INFO ][psql                     ] [Montagne Durbonas] publish_address {127.0.0.1:5432}, bound_addresses {127.0.0.1:5432}
[2017-04-26T20:21:50,247][INFO ][i.c.b.BlobService        ] [Montagne Durbonas] BlobService.doStart() io.crate.blob.BlobService@1a8e44fe
[2017-04-26T20:21:50,272][INFO ][o.e.h.HttpServer         ] [Montagne Durbonas] publish_address {127.0.0.1:4200}, bound_addresses {[::1]:4200}, {127.0.0.1:4200}
[2017-04-26T20:21:50,303][INFO ][o.e.t.TransportService   ] [Montagne Durbonas] publish_address {127.0.0.1:4300}, bound_addresses {[::1]:4300}, {127.0.0.1:4300}
[2017-04-26T20:22:20,336][WARN ][o.e.n.Node               ] [Montagne Durbonas] timed out while waiting for initial discovery state - timeout: 30s
[2017-04-26T20:22:20,336][INFO ][o.e.n.Node               ] [Montagne Durbonas] started

This occurs on all nodes. We’ve double checked that they can reach each other and they can over both 4200 and 4300.

If we attempt to curl the index page of one of the nodes we get a response like this:

{
  "ok" : false,
  "status" : 503,
  "name" : "Elm",
  "cluster_name" : "crate",
  "version" : {
    "number" : "1.2.0",
    "build_hash" : "af006fa24762e47da523e09181258a7a3cda5849",
    "build_timestamp" : "2017-04-24T11:58:22Z",
    "build_snapshot" : false,
    "es_version" : "5.0.2",
    "lucene_version" : "6.2.1"
  }
}

After we were unable to upgrade an existing cluster we tried to roll out a brand new cluster with the AMI mentioned above and experienced the exact same issue. Rolling back to 1.1.1 and the cluster came up fine including discovering themselves.

Steps to reproduce:

  1. Launch 2 or more ami-920798f2 in the same region, subnet.
  2. Ensure that they have the same security group, one shared tag and that the security group allows traffic on 4300 between themselves. Also ensure they have a shared IAM user with ec2 describe instances.
  3. Modify the /etc/crate/crate.yml with the aforementioned configuration.
  4. The nodes should not be able to discover themselves.

Thank you for the help and please let me know if I need to add more details anywhere.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
chaudumcommented, May 2, 2017

hi @petreboy14 I’ve been looking at the relevant code and discovered that the automatic detection of the region has accidentally been removed with the 1.2 update 😦 we’ll provide a fix for that in the next 1.2.x release

0reactions
petreboy14commented, May 2, 2017

@chaudum great news! Looking forward.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting the Discovery Connector - AWS Documentation
The Discovery Connector configuration fails if a connection can't be established. To fix the connection to AWS. Check with your IT admin to...
Read more >
varnish-discovery - Varnish High Availability
varnish-discovery supports multiple backends to generate a nodes.conf , each with their own set of switches (most are optional), so to help you...
Read more >
VMware Cloud on AWS Release Notes
This role update will enable the CloudAdmin user and any users in the CloudAdminGroup to grant other users or groups read-only access to...
Read more >
Akka Cluster Bootstrap • Akka Management - Documentation
It builds on the flexibility of Akka Discovery, leveraging a range of ... to see the /bootstrap/seed-nodes of the node that self-joined and...
Read more >
Can't Join New Servers to Existing Cluster - HashiCorp Discuss
On the new server being added, I can see 16 members (2 of them are new and not working). / # consul members...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found