question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Crate Cluster not forming on Docker Swarm

See original GitHub issue

CrateDB version:

3.0.5 (but it seems any > 2.0.6 2.3.11)

Environment description:

  • Official Crate Docker Image, e.g. crate:3.0.5
  • 2-nodes local swarm (tried with 3 also, but 2 is faster to reproduce)
  • Docker version 18.06.1-ce, build e68fc7a
  • All replicas attached to a docker overlay network

Problem description:

The latest version of Crate (3.0.7) is not able to form a cluster when running on Docker Swarm mode as it used to until version 2.0.6 2.3.11.

The previous idea was to use docker’s dnsrr and set -Cdiscovery.zen.ping.unicast.hosts to the name of the docker service so the discovery would eventually gather all actual container endpoints. This, plus using -Cnetwork.host flag, which I used to use with value 0.0.0.0 (none of the other options _local_ nor _site_ work now).

The issue seems to be in the discovery process (see logs at the end).

I wanted to understand a bit better what’s the issue under the hood so as to better judge:

  • If there’s something I can do?
  • Which are the options we’d have?

I also though opening this issue would be a good way to keep informed on what’s decided on the matter.

Thanks for your work!

Steps to reproduce:

Running this docker-compose.yml would do.

version: '3.3'
services:
  crate:
    image: crate:3.0.5
    command: ["crate",
        "-Clicense.enterprise=false",
        "-Cgateway.expected_nodes=2",
        "-Cgateway.recover_after_nodes=1",
        "-Cgateway.recover_after_time=5m",
        "-Cdiscovery.zen.minimum_master_nodes=1",
        "-Cdiscovery.zen.ping.unicast.hosts=crate",
        "-Cdiscovery.zen.ping_timeout=15s",
        "-Cnetwork.host=_local_",
        "-Chttp.cors.enabled=true",
        '-Chttp.cors.allow-origin="*"']
    environment:
      - MAX_MAP_COUNT=262144
      - ES_JAVA_OPTS="-Xms1g -Xmx1g"
      - CRATE_HEAP_SIZE=1g
    deploy:
      endpoint_mode: dnsrr
      mode: global
      labels:
        - "traefik.port=4200"
        - "traefik.frontend.rule=Host:crate.mydomain.com"
        - "traefik.backend.loadbalancer.sticky=true"
        - "traefik.backend=crate"
        - "traefik.backend.loadbalancer.swarm=false"
      update_config:
        parallelism: 1
        delay: 10s

    volumes:
      - cratedata:/data
    networks:
      - backend

volumes:
  cratedata:

networks:
  backend:
    driver: overlay

Additional Logs:

crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:35,812][INFO ][o.e.n.Node               ] [Tête du Clotonnet] initializing ...
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:36,037][INFO ][o.e.e.NodeEnvironment    ] [Tête du Clotonnet] using [1] data paths, mounts [[/data (/dev/sda1)]], net usable_space [15.3gb], net total_space [17.8gb], types [ext4]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:36,039][INFO ][o.e.e.NodeEnvironment    ] [Tête du Clotonnet] heap size [1015.6mb], compressed ordinary object pointers [true]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:37,385][INFO ][i.c.plugin               ] [Tête du Clotonnet] plugins loaded: [jmx-monitoring, hyperLogLog, lang-js]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
crate3_crate.0.t16mhtw8djlh@ms-worker0    | SLF4J: Defaulting to no-operation (NOP) logger implementation
crate3_crate.0.t16mhtw8djlh@ms-worker0    | SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,398][INFO ][o.e.p.PluginsService     ] [Tête du Clotonnet] no modules loaded
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,410][INFO ][o.e.p.PluginsService     ] [Tête du Clotonnet] loaded plugin [crate-azure-discovery]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,411][INFO ][o.e.p.PluginsService     ] [Tête du Clotonnet] loaded plugin [es-repository-hdfs]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,411][INFO ][o.e.p.PluginsService     ] [Tête du Clotonnet] loaded plugin [io.crate.plugin.BlobPlugin]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,411][INFO ][o.e.p.PluginsService     ] [Tête du Clotonnet] loaded plugin [io.crate.plugin.CrateCorePlugin]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,411][INFO ][o.e.p.PluginsService     ] [Tête du Clotonnet] loaded plugin [io.crate.plugin.HttpTransportPlugin]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,411][INFO ][o.e.p.PluginsService     ] [Tête du Clotonnet] loaded plugin [io.crate.plugin.PluginLoaderPlugin]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,411][INFO ][o.e.p.PluginsService     ] [Tête du Clotonnet] loaded plugin [io.crate.plugin.SrvPlugin]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,412][INFO ][o.e.p.PluginsService     ] [Tête du Clotonnet] loaded plugin [io.crate.udc.plugin.UDCPlugin]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,412][INFO ][o.e.p.PluginsService     ] [Tête du Clotonnet] loaded plugin [org.elasticsearch.analysis.common.CommonAnalysisPlugin]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,414][INFO ][o.e.p.PluginsService     ] [Tête du Clotonnet] loaded plugin [org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,414][INFO ][o.e.p.PluginsService     ] [Tête du Clotonnet] loaded plugin [org.elasticsearch.plugin.repository.url.URLRepositoryPlugin]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,415][INFO ][o.e.p.PluginsService     ] [Tête du Clotonnet] loaded plugin [org.elasticsearch.repositories.s3.S3RepositoryPlugin]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,415][INFO ][o.e.p.PluginsService     ] [Tête du Clotonnet] loaded plugin [org.elasticsearch.transport.Netty4Plugin]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,480][INFO ][o.e.n.Node               ] [Tête du Clotonnet] node name [Tête du Clotonnet], node ID [2HtcoZYZT_m-W6Onr4sZJg]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,548][INFO ][o.e.n.Node               ] [Tête du Clotonnet] CrateDB version[3.0.5], pid[1], build[8970370/2018-07-31T06:18:44Z], OS[Linux/4.9.93-boot2docker/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_171/25.171-b11]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:39,549][INFO ][o.e.n.Node               ] [Tête du Clotonnet] JVM arguments [-Xms1g, -Xmx1g, -Djava.awt.headless=true, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Xloggc:/data/log/gc.log, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=16, -XX:GCLogFileSize=64m, -XX:+DisableExplicitGC, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/data/data, -XX:+UnlockExperimentalVMOptions, -XX:+UseCGroupMemoryLimitForHeap, -Des.cgroups.hierarchy.override=/]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:43,177][INFO ][o.e.d.DiscoveryModule    ] [Tête du Clotonnet] using discovery type [zen]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:45,824][INFO ][i.c.p.s.SslContextProvider] HTTP SSL support is disabled.
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:46,878][INFO ][o.e.n.Node               ] [Tête du Clotonnet] initialized
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:46,879][INFO ][o.e.n.Node               ] [Tête du Clotonnet] starting ...
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:47,057][INFO ][psql                     ] [Tête du Clotonnet] PSQL SSL support is disabled.
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:47,262][INFO ][psql                     ] [Tête du Clotonnet] publish_address {127.0.0.1:5432}, bound_addresses {127.0.0.1:5432}
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:47,315][INFO ][i.c.p.h.CrateNettyHttpServerTransport] [Tête du Clotonnet] publish_address {127.0.0.1:4200}, bound_addresses {127.0.0.1:4200}
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:47,352][INFO ][o.e.t.TransportService   ] [Tête du Clotonnet] publish_address {127.0.0.1:4300}, bound_addresses {127.0.0.1:4300}
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:03:50,145][WARN ][o.e.d.z.UnicastZenPing   ] [Tête du Clotonnet] failed to resolve host [crate]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | java.net.UnknownHostException: crate: Name does not resolve
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) ~[?:1.8.0_171]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) ~[?:1.8.0_171]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) ~[?:1.8.0_171]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at java.net.InetAddress.getAllByName0(InetAddress.java:1276) ~[?:1.8.0_171]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at java.net.InetAddress.getAllByName(InetAddress.java:1192) ~[?:1.8.0_171]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at java.net.InetAddress.getAllByName(InetAddress.java:1126) ~[?:1.8.0_171]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at org.elasticsearch.transport.TcpTransport.parse(TcpTransport.java:917) ~[crate-app-3.0.5.jar:3.0.5]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at org.elasticsearch.transport.TcpTransport.addressesFromString(TcpTransport.java:872) ~[crate-app-3.0.5.jar:3.0.5]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at org.elasticsearch.transport.TransportService.addressesFromString(TransportService.java:699) ~[crate-app-3.0.5.jar:3.0.5]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at org.elasticsearch.discovery.zen.UnicastZenPing.lambda$null$0(UnicastZenPing.java:213) ~[crate-app-3.0.5.jar:3.0.5]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_171]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [crate-app-3.0.5.jar:3.0.5]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:04:05,192][INFO ][o.e.c.s.MasterService    ] [Tête du Clotonnet] zen-disco-elected-as-master ([0] nodes joined), reason: new_master {Tête du Clotonnet}{2HtcoZYZT_m-W6Onr4sZJg}{eX1oGcS7Q1CKRLQlpZKjow}{127.0.0.1}{127.0.0.1:4300}{http_address=127.0.0.1:4200}
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:04:05,208][INFO ][o.e.c.s.ClusterApplierService] [Tête du Clotonnet] new_master {Tête du Clotonnet}{2HtcoZYZT_m-W6Onr4sZJg}{eX1oGcS7Q1CKRLQlpZKjow}{127.0.0.1}{127.0.0.1:4300}{http_address=127.0.0.1:4200}, reason: apply cluster state (from master [master {Tête du Clotonnet}{2HtcoZYZT_m-W6Onr4sZJg}{eX1oGcS7Q1CKRLQlpZKjow}{127.0.0.1}{127.0.0.1:4300}{http_address=127.0.0.1:4200} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)]])
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:04:05,216][INFO ][o.e.g.GatewayService     ] [Tête du Clotonnet] delaying initial state recovery for [5m]. expecting [2] nodes, but only have [1]
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:04:05,223][INFO ][o.e.n.Node               ] [Tête du Clotonnet] started
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:05:04,564][INFO ][o.e.n.Node               ] [Tête du Clotonnet] stopping ...
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:05:04,719][INFO ][o.e.n.Node               ] [Tête du Clotonnet] stopped
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:05:04,720][INFO ][o.e.n.Node               ] [Tête du Clotonnet] closing ...
crate3_crate.0.t16mhtw8djlh@ms-worker0    | [2018-08-30T14:05:04,786][INFO ][o.e.n.Node               ] [Tête du Clotonnet] closed

Update: bump latest versions working (2.3.11) and not working (3.0.7) on docker swarm.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:16 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
sonix07commented, Oct 26, 2018

@taliaga I’m having the same issue but we have the enterprise support. I’ll let you know when we get to the bottom of this behavior.

1reaction
taliagacommented, Oct 2, 2018

Thanks @quodt , that works with crate:2.3.11, so I’ll update the “highest working version”. I have just tried the same setting for 3.0.7 but the reported problem still persists.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Docker Swarm cluster - CrateDB Community
The nodes start but are not forming a cluster even though they are reachable from other containers (tried pinging other nodes). Any ideas...
Read more >
Administer and maintain a swarm of Docker Engines
The best way to recover from losing the quorum is to bring the failed nodes back online. If you can't do that, the...
Read more >
How to Create a Cluster of Docker Containers ... - DigitalOcean
I have crated a swarm cluster with 1 manager and 3 workers, launched nginx application by exposing port 8080 which is running on...
Read more >
Implementing High Availability with Docker Swarm | dockerlabs
3 or more hosts for managers. When planning Docker Swarm HA cluster for production need to take in account resiliency of master nodes....
Read more >
How can I deploy a crate cluster on Giant Swarm?
The current (May 2015) answer is: On a private Giant Swarm cluster, which we provide to customers on request, we support Multicasting.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found