Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Infinite lock issue with RedisLockExtension

See original GitHub issue

What does the bug do?

Produces an infinite lock on resource

How can it be reproduced?

Aha, I’m struggling to build a reproduction, however it can be described via theory.

Setup

CacheStack Redis layer only RedisLockExtension used

What happened?

We started using background refresh in production, and we found we were losing MySQL connections - they effectively got taken from the pool and never returned. This process happens slowly - a couple every few 100k requests.

Initially we turned off background refreshing, but issue was still present. We then just used Get/Set functions (as we did previously) and the issue was not there. Finally we removed the RedisLockExtension without background refreshing, and the issue was not there, and then added background refreshing back in, and issue still not there. This effectively points the finger at RedisLockExtension. We have been stable in production for 48 hours without it (no lost connections from the pool), and prior to this we were losing connections every hour or 2.

Whats going on?

When we build a context via our IoC container, we inject a DB session - under the hood this is grabbing a MySQL connection. Also when most web requests run, we also grab a DB session (and therefore MySQL connection) for the request.

Infinite lock issue

It’s taken me a while to spot:

Try to get a lock - https://github.com/TurnerSoftware/CacheTower/blob/main/src/CacheTower.Extensions.Redis/RedisLockExtension.cs#L73
If we can’t get a lock, we wait on the completion source: https://github.com/TurnerSoftware/CacheTower/blob/main/src/CacheTower.Extensions.Redis/RedisLockExtension.cs#L101
Once the owner of the lock has refreshed a value, it publishes that via redis: https://github.com/TurnerSoftware/CacheTower/blob/main/src/CacheTower.Extensions.Redis/RedisLockExtension.cs#L80
The subs try to clear down that completion source by setting the result: https://github.com/TurnerSoftware/CacheTower/blob/main/src/CacheTower.Extensions.Redis/RedisLockExtension.cs#L50

But what happens if there is a disruption to the Redis connection, and the sub never receives the lock release? Well, in this case you have an infinite lock, and there is no way to release it.

Although a similar process is used in the main CacheStack - https://github.com/TurnerSoftware/CacheTower/blob/main/src/CacheTower/CacheStack.cs#L313 - We aren’t relying on an external dependency to transmit a message to make sure we unlock, instead it happens in the thread context that performed the update and this issue cannot occur.

Thoughts:

We could use a timeout on the completion source? Make sure the lock always releases even if no sub received.
I wonder if a busy wait approach on LockTakeAsync/LockReleaseAsync would be better, rather than relying on pub/sub?

Issue Analytics

State:
Created 2 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

mgoodfellowcommented, Jun 4, 2022

Hey @Turnerj ,

Interesting - in this particular API we run 4 instances in a K8s cluster, but we are using this cache on a hot path of requests (10s per second), with a relatively expensive cache to build.

We happily use this currently with redis, just without this lock extension - which means that we can have the expense of multiple instances building the cache simultaneously, but that tradeoff vs thread starvation is preferable. In fact, we don’t really notice a performance hit when we sometimes build the cache multiple times on different instances.

Of course, it would always be great to know the answer, so it would be interesting to patch this, and I’m happy to spin it into production to see if it fixes it.

1reaction

mgoodfellowcommented, Aug 23, 2021

I finally had a chance to put it all back in production, and had the spin lock etc. enabled, but still suffering this issue. The crazy thing is that I’m seeing connections which have been “lost” for minutes/hours.

I still point the finger at the RedisLockExtension, because if I remove it everything is fine under our production load, and we’ve never seen a lost connection, but within a few hours of running it with this extension enabled, connections get lost. Obviously this is still anecdotal, and at best an educated guess.

I need to do some more digging and actually check to see if these lock keys are being removed from redis (the fail safe should always work if they are?)

Otherwise, there is something else at foot.

I will try to find some time to dig more over the next few days/weeks!

Everything is rock solid without this extension, utilising background refresh and our SimpleInjector cache activator. It isn’t critical for us to have this distributed lock feature (our lock contention is low, and we don’t run huge numbers of pods in our farm), however it is a nice to have, so I will continue to investigate to see if I can get a root cause.

Top Results From Across the Web

Distributed Locks with Redis

Extending locks' lifetime is also an option, but don´t assume that a lock is retained as long as the process that had acquired...

Lock can be extended for unlimited times · Issue #68

My concern is when a client gets stuck in a loop for example. It didn't crash or became unresponsive, so it's still able...

redis lock acquiring not working - stays running and never ...

It might be a problem with the Redis lock implementation, try the following commands to manually acquire and release a lock:

Facing deadlock issue while acquiring the redis lock

I am working on POC of redis cluster. I am using it for following two use cases: distributed cache; distributed lock.

The correct implementation principle of Redis distributed ...

Client 1 successfully acquired the lock and set a 30-second timeout;; Client 1 has slow execution due to some reasons (network problems, FullGC ......