Perpetual timeouts on slickOffsetStorePrepare cluster start task
See original GitHub issueLagom 1.4.4 - Scala API Ubuntu 16.04 Linux bd167f7a097e 4.13.0-36-generic #40-Ubuntu SMP Fri Feb 16 20:07:48 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux java version “1.8.0_152” Java™ SE Runtime Environment (build 1.8.0_152-b16) Java HotSpot™ 64-Bit Server VM (build 25.152-b16, mixed mode)
We run a 2-instance cluster of a Dockerised Lagom service on Marathon (DC/OS), using Marathon contact point discovery (see config below). The problem we are experiencing occurs either when starting both instances up for the first time, or when cycling in a new version of the service (2 old -> 2 old + 1 new -> 1 old + 1 new -> 1 old + 2 new -> 2 new). It does not occur every time though - probably somewhere in the region of 1 - 5% of the time.
When this issue does occur, the symptom is that one of the two instances gets stuck (doesn’t process any events), and keeps logging this exception indefinitely:
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://application/user/slickOffsetStorePrepare-singletonProxy#-965527677]] after [20000 ms]. Sender[null] sent message of type "com.lightbend.lagom.internal.persistence.cluster.ClusterStartupTaskActor$Execute$".
at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:595)
at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:605)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:140)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:866)
at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)
at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:864)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
at java.lang.Thread.run(Thread.java:748)
If I just consider the simplest case, in which two new instances are starting up alongside each other (i.e. launched at the same time), and let’s say Instance X is the one that hits the above error, and Instance Y is the one that starts up normally:
- X starts slightly before Y.
- X becomes aware of Y (via Marathon discovery) before Y even logs anything (we’re talking a few seconds gap though), and starts failing to talk to it (as you would expect).
- X doesn’t join itself because Y has a lower IP address, so keeps trying to talk to find a seed node.
- Y joins itself (because it’s the contact point with the lowest IP)
- Y starts its singleton actors
- Y executes cluster start tasks (like slickOffsetStorePrepare)
- X sees that Y is now a seed node and they both shake hands
- X starts throwing errors like the above one. Y couldn’t care less.
The above events are sequential, and cover approximately 25 seconds of elapsed time.
I probably don’t understand Akka and Lagom well enough to gauge exactly whether this is an Akka issue or a Lagom issue, but I’m currently going with the latter, and guessing the issue might be with com.lightbend.lagom.internal.persistence.jdbc.SlickOffsetStore. I think we have a probable workaround in that we plan to manage the database tables with Liquibase instead of relying on auto table creation, but it would be good if this can can be addressed.
Here’s the relevant cluster management config (there’s no other non-default config of relevance).
akka {
actor.provider = "cluster"
cluster {
auto-down-unreachable-after = 30s
}
discovery {
marathon-api.app-port-name = "akkamgmthttp"
method = "akka-dns"
method = ${?AKKA_DISCOVERY_METHOD}
}
management {
cluster.bootstrap.contact-point-discovery.required-contact-point-nr = 1
http {
hostname = ${?HOST}
port = 19999
port = ${?PORT_AKKAMGMTHTTP}
bind-hostname = 0.0.0.0
bind-port = 19999
}
}
remote {
netty.tcp {
hostname = ${?HOST}
port = 2551
port = ${?PORT_AKKAREMOTE}
bind-hostname = 0.0.0.0
bind-port = 2551
}
}
}
Issue Analytics
- State:
- Created 5 years ago
- Comments:11 (7 by maintainers)
Top GitHub Comments
Thanks @TimMoore. FWIW, I tried to replicate the issue on 1.4.5 and was unable to do so after numerous attempts.
lagom_1336_logs.csv.gz
Sure, have attached a CSV file that hopefully has everything needed. It’s currently sorted in ascending timestamp order, interleaving messages from both nodes. Node X is host_ip 10.0.10.214, and node Y is host_ip 10.0.10.147. Let me know if you need anything else.