Discussion: Automatic retries
See original GitHub issueAt the moment, the client does not automate any retries. This issue is to capture the current behaviour of the client library in various failure scenarios - taken from here.
Points of failure client-broker over gRPC
There are three points of contact:
- Client initiating an operation on the broker
failJob
publishMessage
createWorkflowInstance
resolveIncident
- Worker activating jobs
activateJobs
- Worker completing / failing job
completeJob
failJob
Current Failure Modes
Looking at the current behaviour of each one:
- Client initiating an operation on the broker.
- If the client is started and the broker is not contactable, any operations will throw.
- If the broker becomes available, operations succeed.
- If the broker goes away, operations throw again.
- No automatic retries.
- Worker activating jobs
- If the worker is started and the broker is not contactable, an error is printed to the console.
- If the broker becomes available, the worker activates jobs.
- If the broker goes away, no error is thrown or printed.
- If the broker comes back, jobs are activated.
- In this case, the worker polling is an automatic retry.
- Worker completing jobs
- If the broker goes away after the worker has taken a job, the worker throws
Error: 14 UNAVAILABLE
when it attempts to complete the job. - No automatic retry.
Ways the broker could go away / be unavailable:
- Broker address misconfigured.
- Transient network failure.
- Broker under excessive load.
Broker Address Misconfigured This is an unrecoverable hard failure. Retries will not fix this.
Transient network failure Some temporary disruption in connectivity between worker and broker. This could include a broker restarting or (potentially) a change in DNS (have to test this). A retry will deal with this case if the transient network failure is fixed before the retries timeout.
Broker is under excessive load and cannot respond In this case, retries may actually make it worse. Zeebe is horizontally scalable, but I have driven it to failure on a single node by pumping in a massive number of workloads when it is memory starved (I can kill it with 2GB of memory, but haven’t yet with 4GB) or runs out of disk space (a slow exporter with a high through-put can do this). Having automated retries will not recover any of these situations.
If the broker is experiencing excessive load because of a traffic spike, then automated retries may drive it to failure, whereas workers failing to complete tasks once and letting the broker reschedule them may allow the broker to recover.
Other failure modes not distinguished The as-yet unknown unknowns. Any ideas?
Conclusions
I’m not yet sure that automatic retry is (a) necessary; (b) a good idea.
The transient network failure seems to be the only case. I’m not sure how much of an issue that is in an actual system, and if it warrants complicating the code, or the potential downside of hammering a broker when it is experiencing excessive load (which will be either ineffective if it is a hard failure, and could contribute to a hard failure if it is a spike).
I’m open to more data on this, but I don’t have a case for implementing retries yet.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (5 by maintainers)
Top GitHub Comments
Noting for future reference another Node gRPC library that provides a simple linear retry: grpc-caller
It uses the same interface as async.retry.
Could be a drop-in replacement for node-grpc-client, which would give zeebe-node simple retry for free.
I’d also look at implementing leaky bucket so you can ensure that it’s not a straight retry which would resolve the hammering the broker.
Depending on how you want to perform the load balancing though, would introducing it client side bring you back to the complicating the code?