[Idea] Could workers sometimes know when to release keys on their own?
See original GitHub issueIn https://github.com/dask/distributed/issues/5083#issuecomment-885972668 I wrote up a theory for how high scheduler load can lead to workers running out of memory, because the scheduler is slow to send them free-keys
messages, allowing otherwise-releasable data to pile up. Is there a way to make the scheduler less in the critical path for workers to release memory? (This idea probably overlaps a lot with with / is a subset of #4982 and #3974. Also bear in mind that this theory is completely unproven and just something I made up.)
Could we somehow mark tasks as “safe to release”, so workers know that when they’ve completed all the dependents of a task locally, they can release that task, since no other worker (or client) will need the data?
We can’t say this at submission time, since we haven’t yet scheduled dependencies. (Though tasks with only 1 dependency we could probably eagerly mark as releasable.) But maybe when we assign a task to a worker, we could also look through its immediate dependencies, and any of those that are already assigned to that worker, and have no dependents scheduled on other workers or unscheduled (and not requested by a client), could be marked as releasable.
This could have a nice balanced-budget property, where in many cases the scheduler couldn’t hand out new tasks to workers without also giving them some tasks to release (in the future).
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
Similar behaviour to what your are describing was one of the reasons for the deadlocks in the recent months. Doing this consistently is very difficult.
I think AMM #4982 will already remove most of the problems motivating this since AMM could remove replicas on most workers while few are still using it. The delay of deletion of data on these few workers should not destabilize an entire cluster.
FWIW, I believe we could implement something like this on worker side for the few instances where the worker has the complete information (e.g. it has all dependents of a task in memory) but I’m not sure if this is a very common case.
I would suggest to hold off until AMM is somewhat operational and then try to estimate whether we perceive this still to be a problem.
Well, some race condition is unavoidable but the big question is whether or not we arrive in some corrupt state. The worker would only be allowed to forget a key if it also tells the scheduler such that the key will be rescheduled. Even if this information wasn’t sent to the scheduler, this would trigger a “missing-key” event chain and we’d self heal. Avoiding this kind of rescheduling is only possible if we do not allow the worker to make any decision (as is the case right now). Question here would be what the more common scenario is and how big the impact of this “optimistic release” is. Either way, before baking something like this in, we’d need a few good benchmarks. If the numbers are not convincing I’m inclined to not merge something like this in favour of reduced complexity, as discussed above.