Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Opening Graph dashboard page causes work starvation on workers due to high scheduler CPU load

See original GitHub issue

What happened:

Opening the Graph page of the bokeh dashboard while executing a large task graph (>100k) puts so much load onto the scheduler that it is not able to distribute tasks to workers anymore. This issue cannot be resolved by closing the dashboard again. In fact, the scheduler instance is put into a state where, following the opening of the page once, it cannot handle large tasks graphs ever again and must be restarted.

The configuration option distributed.dashboard.graph-max-items does not protect from this scenario. Opening the dashboard page registers a periodic callback which continuously executes TaskGraph.update which calls the TaskLayout.reset_index and loops over all tasks and their dependencies causing the slowdown of the scheduler.

Below is the profile-server view of an affected cluster

image2020-8-18_14-50-55

What you expected to happen:

Only marginal impact of bokeh dashboard on compute performance.

Minimal Complete Verifiable Example:

Start a cluster

client = Client("...")
client.gather(client.map(lambda x:x, range(100000)))

Open bokeh dashboard and open the graph and monitor CPU load

Issue Analytics

State:
Created 3 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, Aug 20, 2020

That sounds like a great idea to me. If only we had a Bokeh expert around …

On Thu, Aug 20, 2020 at 7:49 AM Julia Signell notifications@github.com wrote:

I think it makes sense to disable it and show some explanatory message. I have been alarmed by a blank graph page before.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/4055#issuecomment-677712475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTAYCDIOIRCWILW6PYTSBUZYVANCNFSM4QGBRZYQ .

1reaction

mrocklincommented, Aug 20, 2020

The plugin is used only for this dashboard plot. Removing or pausing it in this situation sounds good to me.

Top Results From Across the Web

Cluster hangs with a few tasks in "processing" state but no cpu ...

An attempt at a data transfer to facilitate one of those tasks then fails. Then the scheduler tries to restart the workers, which...

Common Issues to Monitor | CockroachDB Docs

An unhealthy LSM can be caused by other factors, including under-provisioned storage. To correlate this symptom with CPU starvation, check for high CPU...

IIS worker process: High CPU usage (Expert guide)

Learn how to effectively monitor, fix, and prevent poor performance due to high CPU usage in your IIS worker process.

Operating Systems: CPU Scheduling

Priority scheduling can suffer from a major problem known as indefinite blocking, or starvation, in which a low-priority task can wait forever because...

Tableau Server Data Engine

There is a high load of queries. This can happen if a server is under stress due to overload of multiple client requests...