question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Rare case of collision when do multiples request/reply

See original GitHub issue

Hi !

We are currently using the library (and thanks for the work of the contributor of the project !), and it seems when we do multiples REQ/REPLY on the same time from the client, similar at:

async def a(nc):
    msg = nc.request(subject="foo", payload={})
    # Do something with msg

async def b(nc):
    msg = nc.request(subject="bar", payload={})
    # Do something with msg  

await asyncio.gather(
    a(nc),
    b(nc),
)

We could send, on a really rare case, the same reply id to the server, who make the function b, receive the response from the request a.

Concerning our log, from the server side:

2020-11-09 14:17:19 ... Message(subject='foo', reply='_INBOX.Kjq4GobYoPqdOsTvlGKcbf.Kjq4GobYoPqdOuTvlGKcbf', ...)
2020-11-09 14:17:19 ... Message(subject='bar', reply='_INBOX.Kjq4GobYoPqdOsTvlGKcbf.Kjq4GobYoPqdOuTvlGKcbf', ...)

Concerning the environment we found the id collision, it was during internal benchmark for our application, where we had our API connecting with a nats server, as a reply and as a publisher, and an another server who spawn 1 new client each second and do two request at startup, while receiving message continuously.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
Smeritycommented, Dec 19, 2020

Thanks @YiuRULE for noting the issue and producing code that reproduces the bug. I have run into the same issue and checked here to see if someone else had saved me from the work and the bug reproduces cleanly for me after a little wait ❤️

I have run into this in issue as well but rather than spawning multiple clients rapidly it was a single client with multiple in-flight requests. The issue appeared when hitting ~100 concurrent requests but not with ~50.

Happy to test any potential bug fixes or do bug chasing if that’s helpful. Thanks for checking the issue out @charliestrawn =]


Out of curiousity I ran nuid.py on a loop to see if it’d spit out a duplicate and that doesn’t seem like it. That was likely already checked before but I just wanted to run the experiment to make my laptop warm ^_^

The issue may be an underlying race condition between the three del self._resp_map[token] calls (two in wait_for_msgs and one at the end of request), specifically that the request’s asyncio.TimeoutError also triggers wait_for_msg’s asyncio.CancelledError or similar. Commenting out the two del calls in wait_for_msgs prevents reproduction of the bug. I can’t work out why that would be however.

Potential fix: Adding an if token in self._resp_map guard before each del prevents the issue from occurring in my local setup. concurrent.futures._base.TimeoutError and nats.aio.errors.ErrTimeout exceptions occur at about the same frequency as the KeyError would have.

I don’t think that’d result in any larger issues either as each call is essentially performing garbage collection. If an if fails then it just means the garbage has already been collected by someone else.

Happy to submit a pull request with the given fix. My only concern is that I don’t understand the rest of the codebase enough to be fully confident in my “garbage collection cleanup” assumption in the previous paragraph.

1reaction
wallyqscommented, May 17, 2022

Thanks @mriedem for the report, let’s open a new issue since this might be to address a race and the original issue here was a wrong division producing duplicated results sometimes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Collision Tumors: A Rare Case Report - PMC - NCBI
In case of multiple brain tumors, the management priority needs case by case evaluation. The standpoint being that the lesion causing the main...
Read more >
Request/Reply Interactions - TIBCO Product Documentation
You can explictly link a reply to a request with a unique ID token by including the token in a message field (you...
Read more >
NATS Messaging - ThinkMicroservices.com
According to Derek Collision, NATS is intended as the central nervous system ... For these cases we us the Request/Reply message pattern.
Read more >
Troubleshooting Correlation - WCF - Microsoft Learn
The UnknownMessageReceived event occurs when an unknown message is received by a service, including messages that cannot be correlated to an ...
Read more >
Chapter 2 - Sockets and Patterns - ZeroMQ guide
For most common cases, use tcp, which is a disconnected TCP transport. ... protocol using ZeroMQ, using for example the request-reply socket pattern....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found