distributed.lock.Lock deadlocks on worker failure.
See original GitHub issueUse of distributed.lock.Lock
causes deadlocks in cases where a lock-holding worker is lost and/or fails to release the distributed lock due to an error. This blocks all subsequent attempts to acquire the named lock, as the lock can no longer be released.
This can be trivially demonstrated via:
lock_a = Lock("x")
lock_a.acquire()
# simulated worker loss
del lock_a
# ...never acquirable
lock_b = Lock("x")
lock_b.acquire()
The simplest solution to this issue would be to add an (optional, client specified) TTL to acquired locks, allowing any subsequent acquire attempts to break the existing lock and acquire the target iff the lock TTL has passed.
I am happy to open a PR for this feature, but would like to have a second opinion on a few design considerations:
- What is the expected client behavior if the client releases an already-expired lock? The simplest behavior would be to treat the lock as un-acquired and return an equivalent error.
- Should locks support a “renew” operation, extending the TTL of an acquired lock? This could conceivably provide support for long-running operations and/or allow improved responsiveness. This may require extending the current API with a
renew
operation in addition torelease
. - Should lock expiration occur “on expire” or “on demand”? Expiration “on demand” would only break the target lock if another acquire attempt occurs, which may be desirable behavior but would potentially mask occurrences where a long-running operation exceeds the specified TTL during smaller-scale development and testing.
Issue Analytics
- State:
- Created 5 years ago
- Comments:10 (6 by maintainers)
Top Results From Across the Web
How to do distributed locking - Martin Kleppmann
If the lock fails and two nodes concurrently work on the same piece of data, the result is a corrupted file, data loss,...
Read more >Distributed lock overview - Dapr Docs
Dapr distributed locks use a lease-based locking mechanism. ... This prevents resource deadlocks in the event of application failures.
Read more >The Technical Practice of Distributed Locks in a Storage System
When a process fails, the kernel can release the lock resources held by the process. However, this becomes a challenge in a distributed ......
Read more >Distributed Locks with Redis
Distributed locks are a very useful primitive in many environments where different processes must operate with shared resources in a mutually exclusive way....
Read more >Overview of implementing Distributed Locks
But a lock in distributed environment is more than just a mutex in multi-threaded application. It is more complicated due to the fact...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@asford I am running into this issue every once in a while (unacquired locks due to workers failing). Did you have time to submit a PR? I can’t find it in distributed. Thanks.
Ah we just hit this as well - may be worth making a note in docs. When building an HA system, in the case of a benign crash (e.g., GPU OOM), it’s reasonable to expect restarting workers to be safe, but if the responsible crasher is due to a
Lock
, it will currently deadlock even across restarts. (We hit this asbsql
is not reentrant, so we have to gate worker access to it.)