question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Idea] Could workers sometimes know when to release keys on their own?

See original GitHub issue

In https://github.com/dask/distributed/issues/5083#issuecomment-885972668 I wrote up a theory for how high scheduler load can lead to workers running out of memory, because the scheduler is slow to send them free-keys messages, allowing otherwise-releasable data to pile up. Is there a way to make the scheduler less in the critical path for workers to release memory? (This idea probably overlaps a lot with with / is a subset of #4982 and #3974. Also bear in mind that this theory is completely unproven and just something I made up.)

Could we somehow mark tasks as “safe to release”, so workers know that when they’ve completed all the dependents of a task locally, they can release that task, since no other worker (or client) will need the data?

We can’t say this at submission time, since we haven’t yet scheduled dependencies. (Though tasks with only 1 dependency we could probably eagerly mark as releasable.) But maybe when we assign a task to a worker, we could also look through its immediate dependencies, and any of those that are already assigned to that worker, and have no dependents scheduled on other workers or unscheduled (and not requested by a client), could be marked as releasable.

This could have a nice balanced-budget property, where in many cases the scheduler couldn’t hand out new tasks to workers without also giving them some tasks to release (in the future).

cc @fjetter @crusaderky @mrocklin

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
fjettercommented, Jul 26, 2021

Similar behaviour to what your are describing was one of the reasons for the deadlocks in the recent months. Doing this consistently is very difficult.

I think AMM #4982 will already remove most of the problems motivating this since AMM could remove replicas on most workers while few are still using it. The delay of deletion of data on these few workers should not destabilize an entire cluster.

FWIW, I believe we could implement something like this on worker side for the few instances where the worker has the complete information (e.g. it has all dependents of a task in memory) but I’m not sure if this is a very common case.

I would suggest to hold off until AMM is somewhat operational and then try to estimate whether we perceive this still to be a problem.

1reaction
fjettercommented, Jul 27, 2021

magine that a worker is in the state that it “has all dependents of a task in memory” and thinks it can erase the task, but additional tasks are submitted that are dependents of said task to the scheduler. This could cause a potential problem, no?

Well, some race condition is unavoidable but the big question is whether or not we arrive in some corrupt state. The worker would only be allowed to forget a key if it also tells the scheduler such that the key will be rescheduled. Even if this information wasn’t sent to the scheduler, this would trigger a “missing-key” event chain and we’d self heal. Avoiding this kind of rescheduling is only possible if we do not allow the worker to make any decision (as is the case right now). Question here would be what the more common scenario is and how big the impact of this “optimistic release” is. Either way, before baking something like this in, we’d need a few good benchmarks. If the numbers are not convincing I’m inclined to not merge something like this in favour of reduced complexity, as discussed above.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Power of Small Wins - Harvard Business Review
The key is to learn which actions support progress—such as setting clear goals, providing sufficient ... Even small wins can boost inner work...
Read more >
Guide 6: Basic Business Operations for the Entrepreneur - Citi
There is a lot to know about the operations of a business, and this guide covers a lot of information. Go through this...
Read more >
Protecting Personal Information: A Guide for Business
A sound data security plan is built on 5 key principles: TAKE STOCK. Know what personal information you have in your files and...
Read more >
Project Aristotle - re:Work - Google
The team is the molecular unit where real production happens, where innovative ideas are conceived and tested, and where employees experience most of...
Read more >
Managing for Employee Retention - SHRM
A comprehensive employee retention program can play a vital role in both attracting and retaining key employees, as well as in reducing turnover...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found