Opening Graph dashboard page causes work starvation on workers due to high scheduler CPU load
See original GitHub issueWhat happened:
Opening the Graph
page of the bokeh dashboard while executing a large task graph (>100k) puts so much load onto the scheduler that it is not able to distribute tasks to workers anymore. This issue cannot be resolved by closing the dashboard again. In fact, the scheduler instance is put into a state where, following the opening of the page once, it cannot handle large tasks graphs ever again and must be restarted.
The configuration option distributed.dashboard.graph-max-items
does not protect from this scenario. Opening the dashboard page registers a periodic callback which continuously executes TaskGraph.update
which calls the TaskLayout.reset_index
and loops over all tasks and their dependencies causing the slowdown of the scheduler.
Below is the profile-server view of an affected cluster
What you expected to happen:
Only marginal impact of bokeh dashboard on compute performance.
Minimal Complete Verifiable Example:
- Start a cluster
client = Client("...")
client.gather(client.map(lambda x:x, range(100000)))
- Open bokeh dashboard and open the graph and monitor CPU load
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
That sounds like a great idea to me. If only we had a Bokeh expert around …
On Thu, Aug 20, 2020 at 7:49 AM Julia Signell notifications@github.com wrote:
The plugin is used only for this dashboard plot. Removing or pausing it in this situation sounds good to me.