question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Opening Graph dashboard page causes work starvation on workers due to high scheduler CPU load

See original GitHub issue

What happened:

Opening the Graph page of the bokeh dashboard while executing a large task graph (>100k) puts so much load onto the scheduler that it is not able to distribute tasks to workers anymore. This issue cannot be resolved by closing the dashboard again. In fact, the scheduler instance is put into a state where, following the opening of the page once, it cannot handle large tasks graphs ever again and must be restarted.

The configuration option distributed.dashboard.graph-max-items does not protect from this scenario. Opening the dashboard page registers a periodic callback which continuously executes TaskGraph.update which calls the TaskLayout.reset_index and loops over all tasks and their dependencies causing the slowdown of the scheduler.

Below is the profile-server view of an affected cluster

image2020-8-18_14-50-55

What you expected to happen:

Only marginal impact of bokeh dashboard on compute performance.

Minimal Complete Verifiable Example:

  • Start a cluster
client = Client("...")
client.gather(client.map(lambda x:x, range(100000)))
  • Open bokeh dashboard and open the graph and monitor CPU load

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Aug 20, 2020

That sounds like a great idea to me. If only we had a Bokeh expert around …

On Thu, Aug 20, 2020 at 7:49 AM Julia Signell notifications@github.com wrote:

I think it makes sense to disable it and show some explanatory message. I have been alarmed by a blank graph page before.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/4055#issuecomment-677712475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTAYCDIOIRCWILW6PYTSBUZYVANCNFSM4QGBRZYQ .

1reaction
mrocklincommented, Aug 20, 2020

The plugin is used only for this dashboard plot. Removing or pausing it in this situation sounds good to me.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cluster hangs with a few tasks in "processing" state but no cpu ...
An attempt at a data transfer to facilitate one of those tasks then fails. Then the scheduler tries to restart the workers, which...
Read more >
Common Issues to Monitor | CockroachDB Docs
An unhealthy LSM can be caused by other factors, including under-provisioned storage. To correlate this symptom with CPU starvation, check for high CPU...
Read more >
IIS worker process: High CPU usage (Expert guide)
Learn how to effectively monitor, fix, and prevent poor performance due to high CPU usage in your IIS worker process.
Read more >
Operating Systems: CPU Scheduling
Priority scheduling can suffer from a major problem known as indefinite blocking, or starvation, in which a low-priority task can wait forever because...
Read more >
Tableau Server Data Engine
There is a high load of queries. This can happen if a server is under stress due to overload of multiple client requests...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found