question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Perpetual timeouts on slickOffsetStorePrepare cluster start task

See original GitHub issue

Lagom 1.4.4 - Scala API Ubuntu 16.04 Linux bd167f7a097e 4.13.0-36-generic #40-Ubuntu SMP Fri Feb 16 20:07:48 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux java version “1.8.0_152” Java™ SE Runtime Environment (build 1.8.0_152-b16) Java HotSpot™ 64-Bit Server VM (build 25.152-b16, mixed mode)

We run a 2-instance cluster of a Dockerised Lagom service on Marathon (DC/OS), using Marathon contact point discovery (see config below). The problem we are experiencing occurs either when starting both instances up for the first time, or when cycling in a new version of the service (2 old -> 2 old + 1 new -> 1 old + 1 new -> 1 old + 2 new -> 2 new). It does not occur every time though - probably somewhere in the region of 1 - 5% of the time.

When this issue does occur, the symptom is that one of the two instances gets stuck (doesn’t process any events), and keeps logging this exception indefinitely:

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://application/user/slickOffsetStorePrepare-singletonProxy#-965527677]] after [20000 ms]. Sender[null] sent message of type "com.lightbend.lagom.internal.persistence.cluster.ClusterStartupTaskActor$Execute$".
	at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:595)
	at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:605)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:140)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:866)
	at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)
	at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:864)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
	at java.lang.Thread.run(Thread.java:748)

If I just consider the simplest case, in which two new instances are starting up alongside each other (i.e. launched at the same time), and let’s say Instance X is the one that hits the above error, and Instance Y is the one that starts up normally:

  1. X starts slightly before Y.
  2. X becomes aware of Y (via Marathon discovery) before Y even logs anything (we’re talking a few seconds gap though), and starts failing to talk to it (as you would expect).
  3. X doesn’t join itself because Y has a lower IP address, so keeps trying to talk to find a seed node.
  4. Y joins itself (because it’s the contact point with the lowest IP)
  5. Y starts its singleton actors
  6. Y executes cluster start tasks (like slickOffsetStorePrepare)
  7. X sees that Y is now a seed node and they both shake hands
  8. X starts throwing errors like the above one. Y couldn’t care less.

The above events are sequential, and cover approximately 25 seconds of elapsed time.

I probably don’t understand Akka and Lagom well enough to gauge exactly whether this is an Akka issue or a Lagom issue, but I’m currently going with the latter, and guessing the issue might be with com.lightbend.lagom.internal.persistence.jdbc.SlickOffsetStore. I think we have a probable workaround in that we plan to manage the database tables with Liquibase instead of relying on auto table creation, but it would be good if this can can be addressed.

Here’s the relevant cluster management config (there’s no other non-default config of relevance).

akka {
  actor.provider = "cluster"

  cluster {
    auto-down-unreachable-after = 30s
  }

  discovery {
    marathon-api.app-port-name = "akkamgmthttp"

    method = "akka-dns"
    method = ${?AKKA_DISCOVERY_METHOD}
  }

  management {

    cluster.bootstrap.contact-point-discovery.required-contact-point-nr = 1

    http {
      hostname = ${?HOST}
      port = 19999
      port = ${?PORT_AKKAMGMTHTTP}

      bind-hostname = 0.0.0.0
      bind-port = 19999
    }
  }

  remote {
    netty.tcp {
      hostname = ${?HOST}
      port = 2551
      port = ${?PORT_AKKAREMOTE}

      bind-hostname = 0.0.0.0
      bind-port = 2551
    }
  }

}

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
bpipercommented, May 17, 2018

Thanks @TimMoore. FWIW, I tried to replicate the issue on 1.4.5 and was unable to do so after numerous attempts.

1reaction
bpipercommented, May 7, 2018

lagom_1336_logs.csv.gz

Sure, have attached a CSV file that hopefully has everything needed. It’s currently sorted in ascending timestamp order, interleaving messages from both nodes. Node X is host_ip 10.0.10.214, and node Y is host_ip 10.0.10.147. Let me know if you need anything else.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Use timeouts to avoid stuck executions - AWS Step Functions
Without an explicit timeout, Step Functions often relies solely on a response from an activity worker to know that a task is complete....
Read more >
What is a Start-To-Close Timeout? | Temporal Documentation
A Start-To-Close Timeout is the maximum time allowed for a single Activity Task Execution.
Read more >
Running numerous tasks on dask cluster with large data ...
I was running a lot of tasks (hundreds and up to thousands) that retrieves large volumes of data and pass them to other...
Read more >
Solaris Cluster Method, Start, Stop or Probe (HA Data Service ...
Solaris Cluster Method, Start, Stop or Probe (HA Data Service) Timeout. Tune fault monitor. Resolution Path. (Doc ID 1020791.1).
Read more >
AO Task Timeouts - 340b opais - HRSA
Task AO Task Returned to AO New Registration End of registration period or 15 days 5 days Reinstatement End of registration period or 15 days...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found