Rare case of collision when do multiples request/reply
See original GitHub issueHi !
We are currently using the library (and thanks for the work of the contributor of the project !), and it seems when we do multiples REQ/REPLY on the same time from the client, similar at:
async def a(nc):
msg = nc.request(subject="foo", payload={})
# Do something with msg
async def b(nc):
msg = nc.request(subject="bar", payload={})
# Do something with msg
await asyncio.gather(
a(nc),
b(nc),
)
We could send, on a really rare case, the same reply id to the server, who make the function b
, receive the response from the request a
.
Concerning our log, from the server side:
2020-11-09 14:17:19 ... Message(subject='foo', reply='_INBOX.Kjq4GobYoPqdOsTvlGKcbf.Kjq4GobYoPqdOuTvlGKcbf', ...)
2020-11-09 14:17:19 ... Message(subject='bar', reply='_INBOX.Kjq4GobYoPqdOsTvlGKcbf.Kjq4GobYoPqdOuTvlGKcbf', ...)
Concerning the environment we found the id collision, it was during internal benchmark for our application, where we had our API connecting with a nats server, as a reply and as a publisher, and an another server who spawn 1 new client each second and do two request at startup, while receiving message continuously.
Issue Analytics
- State:
- Created 3 years ago
- Comments:16 (6 by maintainers)
Top Results From Across the Web
Collision Tumors: A Rare Case Report - PMC - NCBI
In case of multiple brain tumors, the management priority needs case by case evaluation. The standpoint being that the lesion causing the main...
Read more >Request/Reply Interactions - TIBCO Product Documentation
You can explictly link a reply to a request with a unique ID token by including the token in a message field (you...
Read more >NATS Messaging - ThinkMicroservices.com
According to Derek Collision, NATS is intended as the central nervous system ... For these cases we us the Request/Reply message pattern.
Read more >Troubleshooting Correlation - WCF - Microsoft Learn
The UnknownMessageReceived event occurs when an unknown message is received by a service, including messages that cannot be correlated to an ...
Read more >Chapter 2 - Sockets and Patterns - ZeroMQ guide
For most common cases, use tcp, which is a disconnected TCP transport. ... protocol using ZeroMQ, using for example the request-reply socket pattern....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks @YiuRULE for noting the issue and producing code that reproduces the bug. I have run into the same issue and checked here to see if someone else had saved me from the work and the bug reproduces cleanly for me after a little wait ❤️
I have run into this in issue as well but rather than spawning multiple clients rapidly it was a single client with multiple in-flight requests. The issue appeared when hitting ~100 concurrent requests but not with ~50.
Happy to test any potential bug fixes or do bug chasing if that’s helpful. Thanks for checking the issue out @charliestrawn =]
Out of curiousity I ran nuid.py on a loop to see if it’d spit out a duplicate and that doesn’t seem like it. That was likely already checked before but I just wanted to run the experiment to make my laptop warm ^_^
The issue may be an underlying race condition between the three
del self._resp_map[token]
calls (two in wait_for_msgs and one at the end of request), specifically that therequest
’sasyncio.TimeoutError
also triggerswait_for_msg
’sasyncio.CancelledError
or similar. Commenting out the twodel
calls inwait_for_msgs
prevents reproduction of the bug. I can’t work out why that would be however.Potential fix: Adding an
if token in self._resp_map
guard before eachdel
prevents the issue from occurring in my local setup.concurrent.futures._base.TimeoutError
andnats.aio.errors.ErrTimeout
exceptions occur at about the same frequency as theKeyError
would have.I don’t think that’d result in any larger issues either as each call is essentially performing garbage collection. If an
if
fails then it just means the garbage has already been collected by someone else.Happy to submit a pull request with the given fix. My only concern is that I don’t understand the rest of the codebase enough to be fully confident in my “garbage collection cleanup” assumption in the previous paragraph.
Thanks @mriedem for the report, let’s open a new issue since this might be to address a race and the original issue here was a wrong division producing duplicated results sometimes.