Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Discussion: Automatic retries

See original GitHub issue

At the moment, the client does not automate any retries. This issue is to capture the current behaviour of the client library in various failure scenarios - taken from here.

Points of failure client-broker over gRPC

There are three points of contact:

Client initiating an operation on the broker

failJob
publishMessage
createWorkflowInstance
resolveIncident

Worker activating jobs

activateJobs

Worker completing / failing job

completeJob
failJob

Current Failure Modes

Looking at the current behaviour of each one:

Client initiating an operation on the broker.

If the client is started and the broker is not contactable, any operations will throw.
If the broker becomes available, operations succeed.
If the broker goes away, operations throw again.
No automatic retries.

Worker activating jobs

If the worker is started and the broker is not contactable, an error is printed to the console.
If the broker becomes available, the worker activates jobs.
If the broker goes away, no error is thrown or printed.
If the broker comes back, jobs are activated.
In this case, the worker polling is an automatic retry.

Worker completing jobs

If the broker goes away after the worker has taken a job, the worker throws Error: 14 UNAVAILABLE when it attempts to complete the job.
No automatic retry.

Ways the broker could go away / be unavailable:

Broker address misconfigured.
Transient network failure.
Broker under excessive load.

Broker Address Misconfigured This is an unrecoverable hard failure. Retries will not fix this.

Transient network failure Some temporary disruption in connectivity between worker and broker. This could include a broker restarting or (potentially) a change in DNS (have to test this). A retry will deal with this case if the transient network failure is fixed before the retries timeout.

Broker is under excessive load and cannot respond In this case, retries may actually make it worse. Zeebe is horizontally scalable, but I have driven it to failure on a single node by pumping in a massive number of workloads when it is memory starved (I can kill it with 2GB of memory, but haven’t yet with 4GB) or runs out of disk space (a slow exporter with a high through-put can do this). Having automated retries will not recover any of these situations.

If the broker is experiencing excessive load because of a traffic spike, then automated retries may drive it to failure, whereas workers failing to complete tasks once and letting the broker reschedule them may allow the broker to recover.

Other failure modes not distinguished The as-yet unknown unknowns. Any ideas?

Conclusions

I’m not yet sure that automatic retry is (a) necessary; (b) a good idea.

The transient network failure seems to be the only case. I’m not sure how much of an issue that is in an actual system, and if it warrants complicating the code, or the potential downside of hammering a broker when it is experiencing excessive load (which will be either ineffective if it is a hard failure, and could contribute to a hard failure if it is a spike).

I’m open to more data on this, but I don’t have a case for implementing retries yet.

Issue Analytics

State:
Created 4 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

jwulfcommented, Jun 3, 2019

Noting for future reference another Node gRPC library that provides a simple linear retry: grpc-caller

It uses the same interface as async.retry.

Could be a drop-in replacement for node-grpc-client, which would give zeebe-node simple retry for free.

1reaction

s3thancommented, Jun 3, 2019

I’d also look at implementing leaky bucket so you can ensure that it’s not a straight retry which would resolve the hammering the broker.

Depending on how you want to perform the load balancing though, would introducing it client side bring you back to the complicating the code?

Top Results From Across the Web

Automatic Retries - Ice

Automatic Retries. Ice may automatically retry a proxy invocation after a failure. ... We discuss idempotent operations in more detail below.

Take it easy on the automatic retries - The Old New Thing

When I saw a discussion of how to simulate retry via try/catch, using as inspiration a Ruby function that retried a network operation...

automatic retries on 500 doesn't happen #615 - GitHub

I have a mutate as follows. await mutate(["/api/v1/Resource/UpdateResource"], await axios(config)). when the server returns a 500, retry ...

How do Automatic Retries work - question - Hangfire Discussion

I can't find any documentation on how automatic retries work. What is the default number of retries? What is the time period between...

[BPT] Automatic activities retries - OutSystems

In my case I'm trying automatic activities and I raise an exception to test the BPT behaivour. So, I see the automatic activity...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Discussion: Automatic retries

Points of failure client-broker over gRPC

Current Failure Modes

Ways the broker could go away / be unavailable:

Conclusions

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Disconnects from cloud

Bump JS external task client dependencies to latest