question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

distributed.lock.Lock deadlocks on worker failure.

See original GitHub issue

Use of distributed.lock.Lock causes deadlocks in cases where a lock-holding worker is lost and/or fails to release the distributed lock due to an error. This blocks all subsequent attempts to acquire the named lock, as the lock can no longer be released.

This can be trivially demonstrated via:


lock_a = Lock("x")
lock_a.acquire()

# simulated worker loss 
del lock_a

# ...never acquirable
lock_b = Lock("x")
lock_b.acquire()

The simplest solution to this issue would be to add an (optional, client specified) TTL to acquired locks, allowing any subsequent acquire attempts to break the existing lock and acquire the target iff the lock TTL has passed.

I am happy to open a PR for this feature, but would like to have a second opinion on a few design considerations:

  1. What is the expected client behavior if the client releases an already-expired lock? The simplest behavior would be to treat the lock as un-acquired and return an equivalent error.
  2. Should locks support a “renew” operation, extending the TTL of an acquired lock? This could conceivably provide support for long-running operations and/or allow improved responsiveness. This may require extending the current API with a renew operation in addition to release.
  3. Should lock expiration occur “on expire” or “on demand”? Expiration “on demand” would only break the target lock if another acquire attempt occurs, which may be desirable behavior but would potentially mask occurrences where a long-running operation exceeds the specified TTL during smaller-scale development and testing.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
aeantipovcommented, Feb 2, 2019

@asford I am running into this issue every once in a while (unacquired locks due to workers failing). Did you have time to submit a PR? I can’t find it in distributed. Thanks.

0reactions
lmeyerovcommented, Aug 18, 2021

Ah we just hit this as well - may be worth making a note in docs. When building an HA system, in the case of a benign crash (e.g., GPU OOM), it’s reasonable to expect restarting workers to be safe, but if the responsible crasher is due to a Lock, it will currently deadlock even across restarts. (We hit this as bsql is not reentrant, so we have to gate worker access to it.)

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to do distributed locking - Martin Kleppmann
If the lock fails and two nodes concurrently work on the same piece of data, the result is a corrupted file, data loss,...
Read more >
Distributed lock overview - Dapr Docs
Dapr distributed locks use a lease-based locking mechanism. ... This prevents resource deadlocks in the event of application failures.
Read more >
The Technical Practice of Distributed Locks in a Storage System
When a process fails, the kernel can release the lock resources held by the process. However, this becomes a challenge in a distributed ......
Read more >
Distributed Locks with Redis
Distributed locks are a very useful primitive in many environments where different processes must operate with shared resources in a mutually exclusive way....
Read more >
Overview of implementing Distributed Locks
But a lock in distributed environment is more than just a mutex in multi-threaded application. It is more complicated due to the fact...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found