Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lagom services do not exit on Akka cluster membership removal

See original GitHub issue

Lagom Version (1.2.x / 1.3.x / etc)

1.3.7

API (Scala / Java / Neither / Both)

Scala and Java

Operating System (Ubuntu 15.10 / MacOS 10.10 / Windows 10)

Observed on OS X and Linux.

JDK (Oracle 1.8.0_112, OpenJDK 1.8.x, Azul Zing)

For OS X:

java version “1.8.0_131” Java™ SE Runtime Environment (build 1.8.0_131-b11) Java HotSpot™ 64-Bit Server VM (build 25.131-b11, mixed mode)

Library Dependencies

Akka SBR 1.0.3

Expected Behavior

Upon simulating or actually providing a network partition, Akka SBR causes a service instance to leave its Akka cluster given that it is deemed to be in the minority when following a keep-majority strategy. Upon leaving the Akka cluster, we expect to see the Lagom instance exit with a non-zero exit code such that its orchestrator may then schedule a new instance.

Actual Behavior

We see that Akka SBR makes the correct decision to leave the cluster but the Lagom service instance does not terminate.

Reproducible Test Case

Using ConductR with a license for 3 agents or more:

Clone https://github.com/typesafehub/prod-suite-management-doc/pull/16
cd prod-suite-management-doc/guides/lagom-sbr-conductr/lagom-scala-sbt
sbt
project hello-lagom-impl
bundle:dist
sandbox run 2.1.5 --no-default-features -n 3 (requires a license for 3 nodes or more)
conduct load eslite
conduct run eslite - this enables logging
conduct load cassandra
conduct run cassandra - not strictly necessary, but reduces the log output considerably
conduct load <press tab key> - this will load the hello-lagom-impl bundle
conduct run hello --scale 3 - 3 instances of the hello-lagom-impl will startup
conduct logs hello -f - follow the progress of startup - you should see that all three members have joined the cluster (look for welcome messages)
conduct info hello will show you the pids of running akka remote endpoints - pick one
Use pstree <pid> to determine the JVM process that runs the hello service
Use ps -SIGSTOP <hello-pid> to suspend the process
You should eventually see messages like “Address is now gated for 5000 ms” and then “irrecoverably failed. Quarantining address” and then “is still unreachable or has not been restarted. Keeping it quarantined”. At this point, use ps -SIGCONT <hello-pid> to unpause the process.
The previously suspended will then start to issue messages like “Downing [akka.tcp://hello-lagom-impl-1@192.168.10.2:10003], including myself, [2] unreachable…”, “Shutting down myself” and “Shutting down”. I think that the last message will be “Remoting shut down”. The process should exit at this time but doesn’t. ps <hello-pid> can be used to verify that the process continues to run.

Issue Analytics

State:
Created 6 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

2reactions

TimMoorecommented, Aug 28, 2017

Here’s a partial thread dump that shows the deadlock:

2017-08-28 13:17:34
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode):

"Attach Listener" #132 daemon prio=9 os_prio=31 tid=0x00007fe934a13000 nid=0x9b07 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Thread-3" #91 prio=5 os_prio=31 tid=0x00007fe9361be800 nid=0x6907 waiting on condition [0x000070000362c000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000007be28aed0> (a scala.concurrent.impl.Promise$CompletionLatch)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
	at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
	at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
	at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
	at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
	at scala.concurrent.Await$.ready(package.scala:169)
	at play.api.Play$$anonfun$stop$1.apply(Play.scala:128)
	at play.utils.Threads$.withContextClassLoader(Threads.scala:21)
	at play.api.Play$.stop(Play.scala:126)
	at play.core.server.NettyServer$$anonfun$stop$1.apply(NettyServer.scala:247)
	at play.core.server.NettyServer$$anonfun$stop$1.apply(NettyServer.scala:247)
	at scala.Option.foreach(Option.scala:257)
	at play.core.server.NettyServer.stop(NettyServer.scala:247)
	at play.core.server.ProdServerStart$$anonfun$start$1.apply$mcV$sp(ProdServerStart.scala:55)
	at play.core.server.RealServerProcess$$anon$1.run(ServerProcess.scala:44)

"Thread-4" #119 daemon prio=5 os_prio=31 tid=0x00007fe934a39000 nid=0x7107 waiting for monitor entry [0x0000700002e14000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.lang.Shutdown.exit(Shutdown.java:212)
	- waiting to lock <0x00000007b8691290> (a java.lang.Class for java.lang.Shutdown)
	at java.lang.Runtime.exit(Runtime.java:109)
	at java.lang.System.exit(System.java:971)
	at com.lightbend.lagom.internal.cluster.JoinClusterImpl$$anonfun$join$2$$anonfun$apply$1$$anon$1.run(JoinClusterImpl.scala:50)
	at java.lang.Thread.run(Thread.java:748)

"hello-lagom-impl-1-akka.actor.default-dispatcher-38" #115 prio=5 os_prio=31 tid=0x00007fe936a47800 nid=0xa107 in Object.wait() [0x000070000424f000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00000007bf294f18> (a play.core.server.RealServerProcess$$anon$1)
	at java.lang.Thread.join(Thread.java:1252)
	- locked <0x00000007bf294f18> (a play.core.server.RealServerProcess$$anon$1)
	at java.lang.Thread.join(Thread.java:1326)
	at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
	at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
	at java.lang.Shutdown.runHooks(Shutdown.java:123)
	at java.lang.Shutdown.sequence(Shutdown.java:167)
	at java.lang.Shutdown.exit(Shutdown.java:212)
	- locked <0x00000007b8691290> (a java.lang.Class for java.lang.Shutdown)
	at java.lang.Runtime.exit(Runtime.java:109)
	at java.lang.System.exit(System.java:971)
	at com.lightbend.lagom.internal.cluster.JoinClusterImpl$$anonfun$join$1.apply$mcV$sp(JoinClusterImpl.scala:34)
	at com.lightbend.lagom.internal.cluster.JoinClusterImpl$$anonfun$join$1.apply(JoinClusterImpl.scala:34)
	at com.lightbend.lagom.internal.cluster.JoinClusterImpl$$anonfun$join$1.apply(JoinClusterImpl.scala:34)
	at akka.actor.ActorSystemImpl$$anon$3.run(ActorSystem.scala:842)
	at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$addRec$1$1.applyOrElse(ActorSystem.scala:1021)
	at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$addRec$1$1.applyOrElse(ActorSystem.scala:1021)
	at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
	at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
	at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
	at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

0reactions

ignasi35commented, Aug 29, 2017

Closing: the scope of this issue is Lagom 1.3.x and we already have #975 for it’s equivalent work on master.