Lagom services do not exit on Akka cluster membership removal
See original GitHub issueLagom Version (1.2.x / 1.3.x / etc)
1.3.7
API (Scala / Java / Neither / Both)
Scala and Java
Operating System (Ubuntu 15.10 / MacOS 10.10 / Windows 10)
Observed on OS X and Linux.
JDK (Oracle 1.8.0_112, OpenJDK 1.8.x, Azul Zing)
For OS X:
java version “1.8.0_131” Java™ SE Runtime Environment (build 1.8.0_131-b11) Java HotSpot™ 64-Bit Server VM (build 25.131-b11, mixed mode)
Library Dependencies
Akka SBR 1.0.3
Expected Behavior
Upon simulating or actually providing a network partition, Akka SBR causes a service instance to leave its Akka cluster given that it is deemed to be in the minority when following a keep-majority
strategy. Upon leaving the Akka cluster, we expect to see the Lagom instance exit with a non-zero exit code such that its orchestrator may then schedule a new instance.
Actual Behavior
We see that Akka SBR makes the correct decision to leave the cluster but the Lagom service instance does not terminate.
Reproducible Test Case
Using ConductR with a license for 3 agents or more:
- Clone https://github.com/typesafehub/prod-suite-management-doc/pull/16
cd prod-suite-management-doc/guides/lagom-sbr-conductr/lagom-scala-sbt
sbt
project hello-lagom-impl
bundle:dist
sandbox run 2.1.5 --no-default-features -n 3
(requires a license for 3 nodes or more)conduct load eslite
conduct run eslite
- this enables loggingconduct load cassandra
conduct run cassandra
- not strictly necessary, but reduces the log output considerablyconduct load <press tab key>
- this will load thehello-lagom-impl
bundleconduct run hello --scale 3
- 3 instances of the hello-lagom-impl will startupconduct logs hello -f
- follow the progress of startup - you should see that all three members have joined the cluster (look for welcome messages)conduct info hello
will show you the pids of running akka remote endpoints - pick one- Use
pstree <pid>
to determine the JVM process that runs the hello service - Use
ps -SIGSTOP <hello-pid>
to suspend the process - You should eventually see messages like “Address is now gated for 5000 ms” and then “irrecoverably failed. Quarantining address” and then “is still unreachable or has not been restarted. Keeping it quarantined”. At this point, use
ps -SIGCONT <hello-pid>
to unpause the process. - The previously suspended will then start to issue messages like “Downing [akka.tcp://hello-lagom-impl-1@192.168.10.2:10003], including myself, [2] unreachable…”, “Shutting down myself” and “Shutting down”. I think that the last message will be “Remoting shut down”. The process should exit at this time but doesn’t.
ps <hello-pid>
can be used to verify that the process continues to run.
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Cluster Membership Service - Documentation - Akka
Once all nodes have seen the exiting state (convergence) the leader will remove the node from the cluster, marking it as removed ....
Read more >Cluster - Lagom Framework
The underlying clustering technology is Akka Cluster. If instances of a service need to know about each other, they must join the same...
Read more >Akka management health checks not running in Lagom 1.5
When your Lagom service is also an Akka Cluster, then the Health Checks will use the Cluster membership status as part of the...
Read more >Akka Clustering - Gossip Convergence and Leader Election ...
The nodes need to become reachable again, or moved to the down and removed states (see the Membership Lifecycle section below). This only...
Read more >Running a Lagom microservice on Akka Cluster with split ...
removed from the cluster. The problem with that is that in a network partition scenario, you will in fact not be able to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Here’s a partial thread dump that shows the deadlock:
Closing: the scope of this issue is Lagom
1.3.x
and we already have #975 for it’s equivalent work onmaster
.