Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

distributed.lock.Lock deadlocks on worker failure.

See original GitHub issue

Use of distributed.lock.Lock causes deadlocks in cases where a lock-holding worker is lost and/or fails to release the distributed lock due to an error. This blocks all subsequent attempts to acquire the named lock, as the lock can no longer be released.

This can be trivially demonstrated via:


lock_a = Lock("x")
lock_a.acquire()

# simulated worker loss 
del lock_a

# ...never acquirable
lock_b = Lock("x")
lock_b.acquire()

The simplest solution to this issue would be to add an (optional, client specified) TTL to acquired locks, allowing any subsequent acquire attempts to break the existing lock and acquire the target iff the lock TTL has passed.

I am happy to open a PR for this feature, but would like to have a second opinion on a few design considerations:

What is the expected client behavior if the client releases an already-expired lock? The simplest behavior would be to treat the lock as un-acquired and return an equivalent error.
Should locks support a “renew” operation, extending the TTL of an acquired lock? This could conceivably provide support for long-running operations and/or allow improved responsiveness. This may require extending the current API with a renew operation in addition to release.
Should lock expiration occur “on expire” or “on demand”? Expiration “on demand” would only break the target lock if another acquire attempt occurs, which may be desirable behavior but would potentially mask occurrences where a long-running operation exceeds the specified TTL during smaller-scale development and testing.

Issue Analytics

State:
Created 5 years ago
Comments:10 (6 by maintainers)

Top GitHub Comments

2reactions

aeantipovcommented, Feb 2, 2019

@asford I am running into this issue every once in a while (unacquired locks due to workers failing). Did you have time to submit a PR? I can’t find it in distributed. Thanks.

0reactions

lmeyerovcommented, Aug 18, 2021

Ah we just hit this as well - may be worth making a note in docs. When building an HA system, in the case of a benign crash (e.g., GPU OOM), it’s reasonable to expect restarting workers to be safe, but if the responsible crasher is due to a Lock, it will currently deadlock even across restarts. (We hit this as bsql is not reentrant, so we have to gate worker access to it.)

Top Results From Across the Web

How to do distributed locking - Martin Kleppmann

If the lock fails and two nodes concurrently work on the same piece of data, the result is a corrupted file, data loss,...

Distributed lock overview - Dapr Docs

Dapr distributed locks use a lease-based locking mechanism. ... This prevents resource deadlocks in the event of application failures.

The Technical Practice of Distributed Locks in a Storage System

When a process fails, the kernel can release the lock resources held by the process. However, this becomes a challenge in a distributed ......

Distributed Locks with Redis

Distributed locks are a very useful primitive in many environments where different processes must operate with shared resources in a mutually exclusive way....

Overview of implementing Distributed Locks

But a lock in distributed environment is more than just a mutex in multi-threaded application. It is more complicated due to the fact...