question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lagom services do not exit on Akka cluster membership removal

See original GitHub issue

Lagom Version (1.2.x / 1.3.x / etc)

1.3.7

API (Scala / Java / Neither / Both)

Scala and Java

Operating System (Ubuntu 15.10 / MacOS 10.10 / Windows 10)

Observed on OS X and Linux.

JDK (Oracle 1.8.0_112, OpenJDK 1.8.x, Azul Zing)

For OS X:

java version “1.8.0_131” Java™ SE Runtime Environment (build 1.8.0_131-b11) Java HotSpot™ 64-Bit Server VM (build 25.131-b11, mixed mode)

Library Dependencies

Akka SBR 1.0.3

Expected Behavior

Upon simulating or actually providing a network partition, Akka SBR causes a service instance to leave its Akka cluster given that it is deemed to be in the minority when following a keep-majority strategy. Upon leaving the Akka cluster, we expect to see the Lagom instance exit with a non-zero exit code such that its orchestrator may then schedule a new instance.

Actual Behavior

We see that Akka SBR makes the correct decision to leave the cluster but the Lagom service instance does not terminate.

Reproducible Test Case

Using ConductR with a license for 3 agents or more:

  1. Clone https://github.com/typesafehub/prod-suite-management-doc/pull/16
  2. cd prod-suite-management-doc/guides/lagom-sbr-conductr/lagom-scala-sbt
  3. sbt
  4. project hello-lagom-impl
  5. bundle:dist
  6. sandbox run 2.1.5 --no-default-features -n 3 (requires a license for 3 nodes or more)
  7. conduct load eslite
  8. conduct run eslite - this enables logging
  9. conduct load cassandra
  10. conduct run cassandra - not strictly necessary, but reduces the log output considerably
  11. conduct load <press tab key> - this will load the hello-lagom-impl bundle
  12. conduct run hello --scale 3 - 3 instances of the hello-lagom-impl will startup
  13. conduct logs hello -f - follow the progress of startup - you should see that all three members have joined the cluster (look for welcome messages)
  14. conduct info hello will show you the pids of running akka remote endpoints - pick one
  15. Use pstree <pid> to determine the JVM process that runs the hello service
  16. Use ps -SIGSTOP <hello-pid> to suspend the process
  17. You should eventually see messages like “Address is now gated for 5000 ms” and then “irrecoverably failed. Quarantining address” and then “is still unreachable or has not been restarted. Keeping it quarantined”. At this point, use ps -SIGCONT <hello-pid> to unpause the process.
  18. The previously suspended will then start to issue messages like “Downing [akka.tcp://hello-lagom-impl-1@192.168.10.2:10003], including myself, [2] unreachable…”, “Shutting down myself” and “Shutting down”. I think that the last message will be “Remoting shut down”. The process should exit at this time but doesn’t. ps <hello-pid> can be used to verify that the process continues to run.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
TimMoorecommented, Aug 28, 2017

Here’s a partial thread dump that shows the deadlock:

2017-08-28 13:17:34
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode):

"Attach Listener" #132 daemon prio=9 os_prio=31 tid=0x00007fe934a13000 nid=0x9b07 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Thread-3" #91 prio=5 os_prio=31 tid=0x00007fe9361be800 nid=0x6907 waiting on condition [0x000070000362c000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000007be28aed0> (a scala.concurrent.impl.Promise$CompletionLatch)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
	at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
	at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
	at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
	at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
	at scala.concurrent.Await$.ready(package.scala:169)
	at play.api.Play$$anonfun$stop$1.apply(Play.scala:128)
	at play.utils.Threads$.withContextClassLoader(Threads.scala:21)
	at play.api.Play$.stop(Play.scala:126)
	at play.core.server.NettyServer$$anonfun$stop$1.apply(NettyServer.scala:247)
	at play.core.server.NettyServer$$anonfun$stop$1.apply(NettyServer.scala:247)
	at scala.Option.foreach(Option.scala:257)
	at play.core.server.NettyServer.stop(NettyServer.scala:247)
	at play.core.server.ProdServerStart$$anonfun$start$1.apply$mcV$sp(ProdServerStart.scala:55)
	at play.core.server.RealServerProcess$$anon$1.run(ServerProcess.scala:44)

"Thread-4" #119 daemon prio=5 os_prio=31 tid=0x00007fe934a39000 nid=0x7107 waiting for monitor entry [0x0000700002e14000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.lang.Shutdown.exit(Shutdown.java:212)
	- waiting to lock <0x00000007b8691290> (a java.lang.Class for java.lang.Shutdown)
	at java.lang.Runtime.exit(Runtime.java:109)
	at java.lang.System.exit(System.java:971)
	at com.lightbend.lagom.internal.cluster.JoinClusterImpl$$anonfun$join$2$$anonfun$apply$1$$anon$1.run(JoinClusterImpl.scala:50)
	at java.lang.Thread.run(Thread.java:748)

"hello-lagom-impl-1-akka.actor.default-dispatcher-38" #115 prio=5 os_prio=31 tid=0x00007fe936a47800 nid=0xa107 in Object.wait() [0x000070000424f000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00000007bf294f18> (a play.core.server.RealServerProcess$$anon$1)
	at java.lang.Thread.join(Thread.java:1252)
	- locked <0x00000007bf294f18> (a play.core.server.RealServerProcess$$anon$1)
	at java.lang.Thread.join(Thread.java:1326)
	at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
	at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
	at java.lang.Shutdown.runHooks(Shutdown.java:123)
	at java.lang.Shutdown.sequence(Shutdown.java:167)
	at java.lang.Shutdown.exit(Shutdown.java:212)
	- locked <0x00000007b8691290> (a java.lang.Class for java.lang.Shutdown)
	at java.lang.Runtime.exit(Runtime.java:109)
	at java.lang.System.exit(System.java:971)
	at com.lightbend.lagom.internal.cluster.JoinClusterImpl$$anonfun$join$1.apply$mcV$sp(JoinClusterImpl.scala:34)
	at com.lightbend.lagom.internal.cluster.JoinClusterImpl$$anonfun$join$1.apply(JoinClusterImpl.scala:34)
	at com.lightbend.lagom.internal.cluster.JoinClusterImpl$$anonfun$join$1.apply(JoinClusterImpl.scala:34)
	at akka.actor.ActorSystemImpl$$anon$3.run(ActorSystem.scala:842)
	at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$addRec$1$1.applyOrElse(ActorSystem.scala:1021)
	at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$addRec$1$1.applyOrElse(ActorSystem.scala:1021)
	at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
	at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
	at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
	at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
0reactions
ignasi35commented, Aug 29, 2017

Closing: the scope of this issue is Lagom 1.3.x and we already have #975 for it’s equivalent work on master.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cluster Membership Service - Documentation - Akka
Once all nodes have seen the exiting state (convergence) the leader will remove the node from the cluster, marking it as removed ....
Read more >
Cluster - Lagom Framework
The underlying clustering technology is Akka Cluster. If instances of a service need to know about each other, they must join the same...
Read more >
Akka management health checks not running in Lagom 1.5
When your Lagom service is also an Akka Cluster, then the Health Checks will use the Cluster membership status as part of the...
Read more >
Akka Clustering - Gossip Convergence and Leader Election ...
The nodes need to become reachable again, or moved to the down and removed states (see the Membership Lifecycle section below). This only...
Read more >
Running a Lagom microservice on Akka Cluster with split ...
removed from the cluster. The problem with that is that in a network partition scenario, you will in fact not be able to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found