Infinite lock issue with RedisLockExtension
See original GitHub issueWhat does the bug do?
Produces an infinite lock on resource
How can it be reproduced?
Aha, I’m struggling to build a reproduction, however it can be described via theory.
Setup
CacheStack Redis layer only RedisLockExtension used
What happened?
We started using background refresh in production, and we found we were losing MySQL connections - they effectively got taken from the pool and never returned. This process happens slowly - a couple every few 100k requests.
Initially we turned off background refreshing, but issue was still present. We then just used Get/Set functions (as we did previously) and the issue was not there. Finally we removed the RedisLockExtension
without background refreshing, and the issue was not there, and then added background refreshing back in, and issue still not there. This effectively points the finger at RedisLockExtension
. We have been stable in production for 48 hours without it (no lost connections from the pool), and prior to this we were losing connections every hour or 2.
Whats going on?
When we build a context via our IoC container, we inject a DB session - under the hood this is grabbing a MySQL connection. Also when most web requests run, we also grab a DB session (and therefore MySQL connection) for the request.
Infinite lock issue
It’s taken me a while to spot:
-
Try to get a lock - https://github.com/TurnerSoftware/CacheTower/blob/main/src/CacheTower.Extensions.Redis/RedisLockExtension.cs#L73
-
If we can’t get a lock, we wait on the completion source: https://github.com/TurnerSoftware/CacheTower/blob/main/src/CacheTower.Extensions.Redis/RedisLockExtension.cs#L101
-
Once the owner of the lock has refreshed a value, it publishes that via redis: https://github.com/TurnerSoftware/CacheTower/blob/main/src/CacheTower.Extensions.Redis/RedisLockExtension.cs#L80
-
The subs try to clear down that completion source by setting the result: https://github.com/TurnerSoftware/CacheTower/blob/main/src/CacheTower.Extensions.Redis/RedisLockExtension.cs#L50
But what happens if there is a disruption to the Redis connection, and the sub never receives the lock release? Well, in this case you have an infinite lock, and there is no way to release it.
Although a similar process is used in the main CacheStack - https://github.com/TurnerSoftware/CacheTower/blob/main/src/CacheTower/CacheStack.cs#L313 - We aren’t relying on an external dependency to transmit a message to make sure we unlock, instead it happens in the thread context that performed the update and this issue cannot occur.
Thoughts:
- We could use a timeout on the completion source? Make sure the lock always releases even if no sub received.
- I wonder if a busy wait approach on LockTakeAsync/LockReleaseAsync would be better, rather than relying on pub/sub?
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (8 by maintainers)
Hey @Turnerj ,
Interesting - in this particular API we run 4 instances in a K8s cluster, but we are using this cache on a hot path of requests (10s per second), with a relatively expensive cache to build.
We happily use this currently with redis, just without this lock extension - which means that we can have the expense of multiple instances building the cache simultaneously, but that tradeoff vs thread starvation is preferable. In fact, we don’t really notice a performance hit when we sometimes build the cache multiple times on different instances.
Of course, it would always be great to know the answer, so it would be interesting to patch this, and I’m happy to spin it into production to see if it fixes it.
I finally had a chance to put it all back in production, and had the spin lock etc. enabled, but still suffering this issue. The crazy thing is that I’m seeing connections which have been “lost” for minutes/hours.
I still point the finger at the
RedisLockExtension
, because if I remove it everything is fine under our production load, and we’ve never seen a lost connection, but within a few hours of running it with this extension enabled, connections get lost. Obviously this is still anecdotal, and at best an educated guess.I need to do some more digging and actually check to see if these lock keys are being removed from redis (the fail safe should always work if they are?)
Otherwise, there is something else at foot.
I will try to find some time to dig more over the next few days/weeks!
Everything is rock solid without this extension, utilising background refresh and our SimpleInjector cache activator. It isn’t critical for us to have this distributed lock feature (our lock contention is low, and we don’t run huge numbers of pods in our farm), however it is a nice to have, so I will continue to investigate to see if I can get a root cause.