question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OPTION/PROPOSAL: Distributed Circuit Breaker

See original GitHub issue

Detailed architecture proposal for a Distributed Circuit Breaker. TL;DR summary will also follow.

Background: What is a Distributed Circuit Breaker?

In a distributed system with multiple instances of an application or service A, there could be an advantage in sharing information about circuit state between instances of upstream system A which all call downstream system B.

The intention is that if the circuit-breaker in node A1 governing calls to system B detects “system B is down” (or at least, breaks), that knowledge could be shared to nodes A2, A3 etc, such that they could also break their circuits governing calls to B if deemed appropriate.

The need for quorum decisions

@reisenberger (and others) observed that a simple implementation - in which all upstream nodes A1, A2, A3 etc simply share the same circuit state - risks a catastrophic failure mode.

The assumption of a simple implementation like this is that if the circuit in A1 governing calls to B breaks, this “means” system B is down. But that is only one of three broad causes for the circuit in A1 governing calls to B breaking. Possibilities include:

(1) node A1 has some internal problem of its own (eg resource starvation, possibly due to unrelated causes) (2) the network path between node A1 and B is faulting (perhaps only locally; other network paths to B may not be affected) (3) system B is down.

In scenarios (1) and (2), if A1 wrongly tells A2, A3, A4 etc that they cannot talk to B (when in fact they can), that would entirely negate the redundancy value of having those horizontally scaled nodes A2, A3 etc … a (self-induced) catastrophic failure.

As commented here, a quorum approach can mitigate this concern.

An extra dependency in the mix

Co-ordinating circuit state between upstream nodes A1, A2, A3 etc also introduces an extra network dependency, which may itself suffer failures or latency.

  • Implementations should minimise extra latency added to individual calls due to co-ordinating state.
  • Implementations must cope with state co-ordination failing.

(Note: I have since discovered the Hystrix team echo these concerns.)

Why then a Distributed Circuit Breaker?

The premise is that when state co-ordination works well - and where empirically in some system, the causes of such circuits breaking are predominantly cause (3) not (1) or (2) - then a distributed circuit breaker may add value.

Without a distributed circuit, upstream nodes A1, A2, A3 etc may each have to endure a number of slow timeouts (say), before they decide to break. With a distributed circuit, knowledge of failing to reach B can be shared, so that other upstream A-nodes may elect to break earlier and fail faster.

The classic application would be in horizontally-scaled microservices systems.


Proposed architecture for DistributedCircuitBreaker

Let us use the terminology DistributedCircuitBreaker (for the distributed collection of circuit breakers acting in concert) and DistributedCircuitBreakerNode for a local circuit-breaker node within that.

DistributedCircuitNodeState

A DistributedCircuitNodeState instance would hold the last-known state for an individual node:

DistributedCircuitNodeState
{
    string DistributedCircuitKey; // identifies which distributed circuit the node is a member of
    string DistributedCircuitNodeKey; // uniquely identifies the particular node
    CircuitState LastState; // last-known circuit state of the node
    DateTimeOffset LastContact; // when node last made contact to the state mediator - helps detect lapsed nodes 
    DateTimeOffset LastLocallyBrokenUntil; // for the last local break of this circuit, when that break was due to expire.  Protects against the case a node fails to make contact again (at all, or in a timely manner), after breaking. 
}

IDistributedCircuitStateMediator

An IDistributedCircuitStateMediator implementation would manage sharing knowledge of node states. Implementations would be in new nuget packages outside the main Polly package (eg Polly.DistributedCircuitStateMediator.Redis), to prevent the main Polly package taking dependencies.

IDistributedCircuitStateMediator
{
    IEnumerable<DistributedCircuitNodeState> GetDistributedNodeStates(string distributedCircuitKey, string requestingNodeKey); 
    void UpdateNodeState(string distributedCircuitKey, string nodeKey, CircuitState localState, DateTimeOffset timestamp, DateTimeOffset? brokenLocallyUntil);

    // And possibly
    Task<IEnumerable<DistributedCircuitNodeState>> GetDistributedNodeStatesAsync(string distributedCircuitKey, string requestingNodeKey); 
    Task UpdateNodeStateAsync(string distributedCircuitKey, string nodeKey, CircuitState localState, DateTimeOffset timestamp, DateTimeOffset? brokenLocallyUntil);
}

Note: have avoided calling this ‘state store’, as implementations may not always be a store (eg async messaging). However, thinking of a state store may be easier for quick grokking. Suggestions for other names welcome!

IDistributedCircuitStateArbiter

Given knowledge of the states of nodes active in the distributed circuit, an IDistributedCircuitStateArbiter would take the quorum decision on whether a ‘distributed break’ should occur (all nodes should break).

IDistributedCircuitStateArbiter
{
    bool ShouldDistributedBreak(IEnumerable<DistributedCircuitNodeState> localNodeStates); 
}

Implementations might be, for example:

SimpleCountDistributedCircuitArbiter : IDistributedCircuitStateArbiter // distributed-breaks if minimum N nodes locally broken

ProportionDistributedCircuitArbiter : IDistributedCircuitStateArbiter // distributed-breaks if proportion of nodes locally broken

Separation of concerns between IDistributedCircuitStateMediator and IDistributedCircuitStateArbiter should allow clean decoupled implementations and unit-testing.

DistributedCircuitController: How local nodes take account of distributed circuit state

Local nodes must govern calls both according to their own (local) breaker state/statistics, and distributed circuit state.

The CircuitController is the existing element which controls circuit behaviour (‘Should this call be allowed to proceed?’; ‘Does the result of this call mean the circuit needs to transition?’). Decorating the local CircuitController with a DistributedCircuitController fits our needs:

  • Where the CircuitController instance would check if a call through the circuit is allowed to proceed, the DistributedCircuitController would intercept and, additionally (each call; or at configurable intervals) consult the IDistributedCircuitStateMediator and IDistributedCircuitStateArbiter to see if the local node should break due to distributed state.

  • Where the local CircuitController transitions to open, due to local statistics, the DistributedCircuitController would inform the IDistributedCircuitStateMediator of its break, and when it is locally broken until.

  • Where the local CircuitController transitions back to Closed, the DistributedCircuitController would/could intercept that and also inform the IDistributedCircuitStateMediator.

  • (Question: Do other local circuit states/transitions need communicating to the central IDistributedCircuitStateMediator too, or not? For the purposes of controlling distributed breaking, we may only be interested in whether nodes are broken due to local causes and until when; we don’t care about other states. However, in a distributed circuit environment, the IDistributedCircuitStateMediator could make a useful central info source for node states, for dashboarding, if it tracks all states.)

Refactoring the existing CircuitController may be needed to expose methods which can be intercepted, for all these events.

Using the decorator pattern also meets other requirements:


Preventing self-perpetuating broken states

Code needs to distinguish whether a circuit is broken due to local events, or to distributed circuit state. This probably implies a new state, CircuitState.DistributedBroken.

Consider that without this, it would be possible to create a system which engendered self-perpetuating broken states. For example, a rule “break all nodes if 50% are broken” would, when triggered, lead to 100% broken. One node recovering might lead to (say) 90% broken, which is still >50%, so the node breaks again … looping back to (permanently) 100% broken.


Handling CircuitState.Isolated

Should CircuitState.Isolated influence distributed state? (cause other nodes to break?). Likely not - users have more control, if they can isolate individual nodes separately.


Enrolment syntax

A syntax for enrolling a local circuit-breaker in a distributed circuit is needed. Examples:

Static method:

DistributedCircuitBreaker distributedBreaker = Policy.DistributedCircuitBreaker(
    CircuitBreakerPolicy localCircuitBreaker, 
    string DistributedCircuitKey, 
    string DistributedCircuitNodeKey, 
    IDistributedCircuitStateMediator mediator, 
    IDistributedCircuitStateArbiter arbiter)

Fluent postfix:

DistributedCircuitBreaker distributedBreaker = localCircuitBreaker.InDistributedCircuit(
    string DistributedCircuitKey, 
    string DistributedCircuitNodeKey,  
    IDistributedCircuitStateMediator mediator, 
    IDistributedCircuitStateArbiter arbiter)

A further post on possible implementations/patterns for IDistributedCircuitStateMediator will follow…

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:6
  • Comments:35 (17 by maintainers)

github_iconTop GitHub Comments

10reactions
reisenbergercommented, Aug 17, 2017

Thank you @mcquiggd for your comments. It looks as if elements of the proposal may have come across not as I intended; I’ll aim to clarify. The comments on resilience in Azure also are great perspectives to have; thank you. There is no intention that Polly should overlap with any of that; we see the role of Polly in architectures also that Polly only provides resilience primitives.

Polly users were asking about a quite specific scenario; I’ll aim to focus discussion back on that, if only to move it forward (but light shone from other angles of course always welcome!).

Summary

  • The DistributedCircuitBreaker users asked about is about multiple upstream callers sharing knowledge of circuit-state when calling a single downstream system.
  • If a DistributedCircuitBreaker models only a single shared state, it effectively treats any one upstream node breaking as cause enough to break all other nodes. This risks becoming a resilience anti-pattern. It risks inappropriately promoting an issue local only to one node/caller to a global/more widespread problem.
    • Circuits can break because of issues local only to the calling system or network path from that calling system; it’s rare, but it happens.
    • When it does happen, a DistributedCircuitBreaker which uses only a single shared state across all upstream callers would cause a catastrophic cascading failure. It’s catastrophic because a single quirk of one caller could (by inappropriately breaking the circuit) cut off all others.
  • The essence of the proposal is that, to avoid this, users can configure the number or proportion of callers breaking, among the set, that are deemed sufficient to cause a distributed break.
  • This can use very simple quorum logic:
    • distributed-break if >=N nodes broken independently; or
    • distributed-break if given proportion of nodes have broken independently.
  • The “one bad node poisons other good nodes” problem is a problem in principle for any DistributedCircuitBreaker which only models a single shared circuit state across upstream nodes. In practice for any given system, one might decide that experience across upstream nodes would be sufficiently consistent, to choose not to engineer for this. But as a library, we don’t have the luxury of knowing the user’s system.

What problem are we trying to solve?

Polly users asked here

I would like to share circuit breaker state among all instances of the application.

and here

i have multiple containers running and if one container has tried calling service and it has tripped the circuit then i don’t want other containers to call the service again which is tripped already by one container

and here

the type of things you are circuit breaking (external services) are highly likely to be used across processes. Unless I’ve completely missed it, I don’t think Polly handles this scenario

and I am understanding these requests to mean we are considering a scenario where multiple upstream services/apps/nodes all call a single downstream dependency.

Diagramatically:

multiple upstream systems calling one downstream system

(The downstream system may have redundancy/horiz-scaling too, but let’s consider that is hidden behind load-balancing/Azure Traffic Manager/whatever, so we are addressing via a single endpoint.)

(We could also be talking about multiple dissimilar upstream systems A, B, C, D etc all calling M.)

With current Polly, each upstream node/system would have an independent circuit-breaker:

multiple upstream nodes calling downstream with independent circuit breakers

I am understanding that Polly users are asking for some way for those circuit-breakers to share state or share knowledge of each other’s state, such that they would (or could choose whether to) break in common.

Why a single shared circuit state can be dangerous: the “one bad node poisons other good nodes” problem

One solution could use a single shared breaker state:

(the breakers exist as independent software elements in each upstream node/system; the solid box round them here is intended to illustrate they all consult/operate to the same shared state)

multiple upstream nodes using single shared circuit breaker state

and the benefit is that if one upstream caller “knows” the downstream system M is down, the other callers immediately share that knowledge (probably failing faster than if they were fully independent):

[*1]

multiple upstream nodes calling downstream with single shared circuit breaker downstream system is down

As however noted in the original proposal and linked threads (here, here, here) and others also noted, an approach with a single shared circuit state also risks (and by definition embeds) a catastrophic failure mode.

If the reason for the circuit-of-A1-governing-M breaking is not that downstream system M is down, but instead a problem local only to A1 (eg resource starvation in A1 affecting those calls) or to the path between A1 and M, then the single shared circuit state inappropriately cuts off all the other nodes/upstream systems from communicating with M:

[*2]

multiple upstream nodes calling downstream with single shared circuit breaker single bad upstream node

(M was healthy (green). Upstream A2, A3 etc were healthy (green) and had good paths (green) to M giving green local circuit statistics. Without the shared circuit state, the system would have had the redundancy benefit of A2, A3 etc still supporting healthy calls to M. However, because A1 has told the single shared circuit state to break, A2, A3 etc cannot communicate to M, and the redundancy value of A2, A3 etc is thrown away.)

This is a resilience anti-pattern: the single shared circuit state approach inappropriately promotes a localised problem to a global/more widespread one, causing a (horizontally) cascading failure.

I call this the “one bad node poisons other good nodes” problem.

Any approach taking the word of a single source as enough evidence that the downstream system is down, risks this: a failure at a single source will block all consumers.

It’s rare, but it can happen. Nodes go bad. We all know the fallacies of distributed computing. The problem is not that it happens very often, but that if/when it does, the single-shared-state design (yoking upstream callers together so tightly) has such a catastrophic effect.

It’s a bit like journalism: don’t trust a story from only one source; corroborate it with others before acting.

(So the simplification proposed in your point 3 @mcquiggd in principle embeds this risk. That would have been the original proposal, were it not for this risk. Of course the simplification may well be the more appropriate solution, if in a given system you judge the risk to be acceptable.)

Without the single shared circuit state, we would have continued to have the redundancy benefit of the horizontal scaling:

[*3]

multiple upstream nodes calling downstream with separate circuit breakers - problem only with one node

So: Can we fashion a solution which gives the benefits of [*1] and [*3], but avoids the failure of [*2]?


‘Crowd sourcing’ whether to break - some simple quorum logic

The essence of the proposal is that we let users configure the number or proportion of nodes breaking, among the set, that are deemed sufficient to cause a distributed break. This creates a non-blunt instrument. Consistency of experience across callers gives enough confidence that the real-world happening is a downstream system failure, not something local to a particular caller.

Rather than seeking to impose a single shared truth model, the approach models the real-world complexity of multiple truths (that different callers might have different experiences of their call success to M). And provides a user-configurable way to negotiate that.

Nodes tell the mediator when their local circuit transitions state. And nodes can ask the mediator the state of all nodes in the set.

Based on this, the arbiter then embeds some extremely simple quorum logic (only requires addition and division):

  • distributed-break if minimum N callers broken independently; OR
  • distributed-break if given proportion of callers have broken independently

This ‘crowd sources’ the wisdom of whether to distributed-break or not, in a non-blunt way.

The solution correctly negotiates the scenario which for the simpler implementation induced the anti-pattern [*2]:

multiple upstream nodes calling downstream with quorum-collaborating circuit breakers - problem only with one node

But also provides the [*1] benefit - if M is down, the quorum logic soon detects this, and breaks:

multiple upstream nodes calling downstream with separate circuit breakers - quorum logic

Mechanisms for sharing state

The mechanisms for sharing state information between nodes are intended exactly to be existing distributed cache technologies such as Redis, NCache etc. (As you say @mcquiggd, all of these with their dual in-memory/remote-syncd caching are great fits.) It also doesn’t have to involve a store: if users already have some asynchronous messaging tech in the mix - Azure Service Bus Queues, Amazon SQS, RabbitMQ, whatever - these are alternatives too.

Polly just provides a primitive

This doesn’t make Polly embed any knowledge about the system being governed. As ever, Polly just provides a resilience primitive. Users model it to their own system by choosing which upstream systems to group into which distributed circuits, what quorum thresholds to configure, etc.

9reactions
heneryvillecommented, Oct 17, 2017

The serverless compute model makes sharing state like this for circuit breakers absolutely necessary.

Right now local-circuit breaking requires each node to learn on it’s own that a given dependency is down. If your instances are long-lived with persistent in-memory state, that’s OK. E.g. on the first 5 requests an instance sees failure, and then circuit breaks the dependency. Those 5 requests are essentially an learning period.

However, with serverless compute models (e.g. AWS Lambda or Azure Cloud Functions) each instance’s internal memory/state is highly transitory. In the extreme, each instance only survives long enough to make a single request to the dependency, and thus no node would survive long enough to learn the dependency’s condition. The system would become equivalent to one with no circuit breaking at all.

Sharing state would become necessary so that all of the nodes contribute their learnings together.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Circuit breaker
A new instance of CircuitBreaker is created using of ; there, we specify the different options. Deprecation in Arrow 1.2. The of constructor...
Read more >
SDV7 and SDV-R outdoor distribution circuit breakers
SDV7 and SDV-R arc-resistant and non-arc-resistant, outdoor, distribution circuit breakers are offered with voltage ratings from 15.5 - 38 kV, ...
Read more >
Spring Cloud Circuit Breaker, Adding Fault Tolerance to ...
Learn how to add fault tolerance to your applications with Spring Cloud Circuit Breaker.
Read more >
PDC3JD20 - Circuit breaker accessory, PowerPacT J, ...
Schneider Electric USA. PDC3JD20 - Circuit breaker accessory, PowerPacT J, power distribution connectors, 150A to 250A, qty 3.
Read more >
CAPAROC electronic circuit breaker system
CAPAROC is your customizable electronic circuit breaker system for overcurrent protection in 12 V to 24 V DC applications. The standard versions offer...
Read more >

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found