question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

In extreme concurrent situation permittedNumberOfCallsInHalfOpenState isn't working properly

See original GitHub issue

Resilience4j version: 1.5.0

Java version: 11

Created a circuit breaker with permittedNumberOfCallsInHalfOpenState(1) but in high load, with timeout issue, HALF_OPEN state creates more than 1 call to the backend.

How to reproduce:

  1. Codebase: https://github.com/rejuan/Resilience4jPractice
  2. Create a mock URL with SOAPUI or any other tool -> URL: http://localhost:8089/mock and set response delay 2 second to generate a timeout scenario
  3. Make a hit with Apache JMeter or any other tool - Number of the thread: 15, Loop Count: infinite

How to observe: In the log, we will find multiple timeout occurrences with a couple of milliseconds difference.

2020-07-21 20:43:28,958 ERROR com.shortandprecise.resilience4jpractice.service.WebClientService [parallel-3] timeout happened
2020-07-21 20:43:28,959 ERROR com.shortandprecise.resilience4jpractice.service.WebClientService [parallel-4] timeout happened

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
AlirezaHakamiancommented, Dec 22, 2020

Hello @RobWin,

I am a research assistant at the university of Stuttgart.

I read the documentation regarding CircuitBreaker provided in https://resilience4j.readme.io/docs/circuitbreaker#

I Formally specified the explained CircuitBreaker behavioral design (sliding window type) in TLA+.

I could verify different invariant and liveness properties regarding the specified design except one, i.e., the number of executions when CircuitBreaker is in the Half state must be at most equal to a pre-defined constant number (permittedNumberOfCallsInHalf). However, the model checker tells me that the system could give permission to a few available threads in the closed state before moving from closed to open state and from open to the half-open state. Therefore, I can never assert that number of executions in the Half state is at most equal to the constant number. In fact, I can assert that number of executions in half state is less than equal permittedNumberOfCallsInHalf + number of available calling threads.

Consider more than 20 threads to obtain permission before the CircuitBreaker switch to Half state. 20 calls + permittedNumberOfCallsInHalf could be a pressure on already down service. Although, one could argue this could be resolved using the bulkhead pattern.

My question is how is this situation handled in the code? Is it at all a design issue related to CircuitBreaker?

Sorry for taking your time, and Merry Christmas in advance. Sincerely, Alireza

1reaction
rejuancommented, Jul 23, 2020

Hi @RobWin,

  1. Request R1 is permitted when the CircuitBreaker is CLOSED and it takes 1 second to timeout.
  2. While request R1 is processed, a previous request R0 is timed out and the CircuitBreaker transitions to HALF_OPEN.
  3. A test request R2 is permitted shortly after R1 and also times out.

Actually above case isn’t possible because the HTTP call timeout is 1 second on the other hand wait duration in the open state is 10 seconds. So WebClient call is initiated from the HALF_OPEN state.

I have updated the sample code with the state transition log. https://github.com/rejuan/Resilience4jPractice/blob/bf12f056eceaa9c572228b675f72ce6b0441e472/src/main/java/com/shortandprecise/resilience4jpractice/config/ApplicationStartUp.java#L41-L42

Added some log as well. Hopefully, it might help.

2020-07-23 14:04:08,925 [parallel-1] timeout happened. actual time: 1022
2020-07-23 14:04:09,179 [parallel-2] timeout happened. actual time: 1003
2020-07-23 14:04:09,510 [parallel-3] timeout happened. actual time: 1002
2020-07-23 14:04:09,845 [parallel-4] timeout happened. actual time: 1002
2020-07-23 14:04:09,964 [parallel-5] timeout happened. actual time: 1001
2020-07-23 14:04:10,176 [parallel-6] timeout happened. actual time: 1002
2020-07-23 14:04:10,210 [parallel-7] timeout happened. actual time: 1002
2020-07-23 14:04:10,512 [parallel-8] timeout happened. actual time: 1003
2020-07-23 14:04:10,531 [parallel-1] timeout happened. actual time: 1001
2020-07-23 14:04:10,840 [parallel-2] timeout happened. actual time: 1002
2020-07-23 14:04:10,846 [parallel-2] State transition from CLOSED to OPEN
2020-07-23 14:04:10,867 [parallel-3] timeout happened. actual time: 1001
2020-07-23 14:04:10,990 [parallel-4] timeout happened. actual time: 1001
2020-07-23 14:04:11,154 [parallel-5] timeout happened. actual time: 1002
2020-07-23 14:04:11,204 [parallel-6] timeout happened. actual time: 1003
2020-07-23 14:04:11,224 [parallel-7] timeout happened. actual time: 1002
2020-07-23 14:04:11,488 [parallel-8] timeout happened. actual time: 1002
2020-07-23 14:04:11,531 [parallel-1] timeout happened. actual time: 1002
2020-07-23 14:04:11,537 [parallel-2] timeout happened. actual time: 1002
2020-07-23 14:04:11,820 [parallel-3] timeout happened. actual time: 1002
2020-07-23 14:04:20,847 [reactor-http-epoll-3] State transition from OPEN to HALF_OPEN
2020-07-23 14:04:21,846 [parallel-6] timeout happened. actual time: 1000
2020-07-23 14:04:21,847 [parallel-6] State transition from HALF_OPEN to OPEN
2020-07-23 14:04:21,846 [parallel-7] timeout happened. actual time: 1000
2020-07-23 14:04:21,846 [parallel-5] timeout happened. actual time: 1000
2020-07-23 14:04:21,846 [parallel-4] timeout happened. actual time: 1000
2020-07-23 14:04:21,847 [parallel-4] timeout happened. actual time: 1001
2020-07-23 14:04:21,848 [parallel-8] timeout happened. actual time: 1001
2020-07-23 14:04:21,848 [parallel-4] timeout happened. actual time: 1001
2020-07-23 14:04:21,850 [parallel-4] timeout happened. actual time: 1004
2020-07-23 14:04:21,851 [parallel-1] timeout happened. actual time: 1002
2020-07-23 14:04:31,847 [reactor-http-epoll-5] State transition from OPEN to HALF_OPEN
2020-07-23 14:04:32,847 [parallel-3] timeout happened. actual time: 1000
2020-07-23 14:04:32,847 [parallel-2] timeout happened. actual time: 1000
2020-07-23 14:04:32,848 [parallel-4] timeout happened. actual time: 1000
2020-07-23 14:04:32,848 [parallel-2] timeout happened. actual time: 1001
2020-07-23 14:04:32,848 [parallel-3] State transition from HALF_OPEN to OPEN
2020-07-23 14:04:32,848 [parallel-2] timeout happened. actual time: 1001
2020-07-23 14:04:32,849 [parallel-2] timeout happened. actual time: 1001
2020-07-23 14:04:42,848 [reactor-http-epoll-8] State transition from OPEN to HALF_OPEN
2020-07-23 14:04:43,850 [parallel-5] timeout happened. actual time: 1002
2020-07-23 14:04:43,851 [parallel-5] State transition from HALF_OPEN to OPEN
2020-07-23 14:04:43,853 [parallel-6] timeout happened. actual time: 1005
2020-07-23 14:04:53,851 [reactor-http-epoll-1] State transition from OPEN to HALF_OPEN
2020-07-23 14:04:54,855 [parallel-8] timeout happened. actual time: 1004
2020-07-23 14:04:54,855 [parallel-7] timeout happened. actual time: 1004
2020-07-23 14:04:54,855 [parallel-8] State transition from HALF_OPEN to OPEN
Read more comments on GitHub >

github_iconTop Results From Across the Web

Resiliency patterns with Spring Boot and Resilience4j - Medium
Resiliency is the ability of application to recover from certain types of failures and remain functional.
Read more >
City Library — An advanced guide to Circuit Breakers in Kotlin
We will check how the circuit-breaker status changes and understand ... In order for this to work, we need to activate the right...
Read more >
How's the behaviour of circuit breaker in HALF_OPEN state ...
That means if you have 3 concurrent calls in HALF_OPEN state, two are permitted and 1 is rejected. But if 2 calls are...
Read more >
Circuit Breaker Pattern With Spring Boot | Vinsguru
Let's consider below architecture in which Service B depends on Service C which has an issue. It is not behaving correctly. With this...
Read more >
Spring Cloud Gateway custom filter ... - Programming VIP
How to obtain the status of circuit breaker ... GatewayFilterChain chain) { // Concurrency is not considered here.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found