question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Discussion: Automatic retries

See original GitHub issue

At the moment, the client does not automate any retries. This issue is to capture the current behaviour of the client library in various failure scenarios - taken from here.

Points of failure client-broker over gRPC

There are three points of contact:

  1. Client initiating an operation on the broker
  • failJob
  • publishMessage
  • createWorkflowInstance
  • resolveIncident
  1. Worker activating jobs
  • activateJobs
  1. Worker completing / failing job
  • completeJob
  • failJob

Current Failure Modes

Looking at the current behaviour of each one:

  1. Client initiating an operation on the broker.
  • If the client is started and the broker is not contactable, any operations will throw.
  • If the broker becomes available, operations succeed.
  • If the broker goes away, operations throw again.
  • No automatic retries.
  1. Worker activating jobs
  • If the worker is started and the broker is not contactable, an error is printed to the console.
  • If the broker becomes available, the worker activates jobs.
  • If the broker goes away, no error is thrown or printed.
  • If the broker comes back, jobs are activated.
  • In this case, the worker polling is an automatic retry.
  1. Worker completing jobs
  • If the broker goes away after the worker has taken a job, the worker throws Error: 14 UNAVAILABLE when it attempts to complete the job.
  • No automatic retry.

Ways the broker could go away / be unavailable:

  • Broker address misconfigured.
  • Transient network failure.
  • Broker under excessive load.

Broker Address Misconfigured This is an unrecoverable hard failure. Retries will not fix this.

Transient network failure Some temporary disruption in connectivity between worker and broker. This could include a broker restarting or (potentially) a change in DNS (have to test this). A retry will deal with this case if the transient network failure is fixed before the retries timeout.

Broker is under excessive load and cannot respond In this case, retries may actually make it worse. Zeebe is horizontally scalable, but I have driven it to failure on a single node by pumping in a massive number of workloads when it is memory starved (I can kill it with 2GB of memory, but haven’t yet with 4GB) or runs out of disk space (a slow exporter with a high through-put can do this). Having automated retries will not recover any of these situations.

If the broker is experiencing excessive load because of a traffic spike, then automated retries may drive it to failure, whereas workers failing to complete tasks once and letting the broker reschedule them may allow the broker to recover.

Other failure modes not distinguished The as-yet unknown unknowns. Any ideas?


Conclusions

I’m not yet sure that automatic retry is (a) necessary; (b) a good idea.

The transient network failure seems to be the only case. I’m not sure how much of an issue that is in an actual system, and if it warrants complicating the code, or the potential downside of hammering a broker when it is experiencing excessive load (which will be either ineffective if it is a hard failure, and could contribute to a hard failure if it is a spike).


I’m open to more data on this, but I don’t have a case for implementing retries yet.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jwulfcommented, Jun 3, 2019

Noting for future reference another Node gRPC library that provides a simple linear retry: grpc-caller

It uses the same interface as async.retry.

Could be a drop-in replacement for node-grpc-client, which would give zeebe-node simple retry for free.

1reaction
s3thancommented, Jun 3, 2019

I’d also look at implementing leaky bucket so you can ensure that it’s not a straight retry which would resolve the hammering the broker.

Depending on how you want to perform the load balancing though, would introducing it client side bring you back to the complicating the code?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Automatic Retries - Ice
Automatic Retries. Ice may automatically retry a proxy invocation after a failure. ... We discuss idempotent operations in more detail below.
Read more >
Take it easy on the automatic retries - The Old New Thing
When I saw a discussion of how to simulate retry via try/catch, using as inspiration a Ruby function that retried a network operation...
Read more >
automatic retries on 500 doesn't happen #615 - GitHub
I have a mutate as follows. await mutate(["/api/v1/Resource/UpdateResource"], await axios(config)). when the server returns a 500, retry ...
Read more >
How do Automatic Retries work - question - Hangfire Discussion
I can't find any documentation on how automatic retries work. What is the default number of retries? What is the time period between...
Read more >
[BPT] Automatic activities retries - OutSystems
In my case I'm trying automatic activities and I raise an exception to test the BPT behaivour. So, I see the automatic activity...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found