question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Profiling Scheduler Performance

See original GitHub issue

When running Dask with TLS security turned on, the scheduler can easily become bottlenecked by SSL communication. See the performance report below, in particular the “Scheduler Profile” tab.

https://gistcdn.githack.com/mrocklin/1d24fbac2c66364d8717952ee8827c64/raw/9c87343cb358b54eb26dbfe8b0714120c0f5ad81/report.html

I ran this on my laptop with the following code

import dask
from dask.distributed import Client, performance_report, wait

client = Client(security=True)

dask.config.set({"optimization.fuse.active": False})

df = dask.datasets.timeseries(start="2020-01-01", end="2020-01-31", partition_freq="1h", freq="60s").persist()

with performance_report("report.html"):
    df2 = df.set_index("x").persist()
    wait(df2)

When this is run on a larger cluster with many workers this problem becomes significantly worse. What are some ways that we can reduce the cost of secure communication?

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:83 (76 by maintainers)

github_iconTop GitHub Comments

3reactions
maartenbreddelscommented, Jan 21, 2021

A small fraction of the calls have to wait for the GIL before returning, and therefore take more than 10 ms.

@pitrou Just curious why you think 10ms, I’d guess 5ms from the default of sys.getswitchinterval

System calls like socket.send release the GIL so tend to attract GIL switches on them, which is why they can look so costly 😃

I’ve been thinking a bit what the best way is to describe why it is/looks so costly. Would you agree that it’s not perse the GIL switch that is costly (e.g the thread context switch is relatively cheap), but that often releasing the GIL in a thread within 5ms, while other threads don’t do that (like running pure Python code) will lead that thread too often having to wait for the GIL.

“what is taking time and making things slow?”

I think it is the attempt to return from the (Python) send function, which tries to obtain the GIL. The (Python) send function released the GIL, does the send syscall, which returns in 1us, but wanting to return to Python land, it first needs to acquire the GIL. If other threads are not as eager to release the GIL (such as pure Python code, which will do that only after 5ms, or a long running c-function that does not release the GIL), send will effectively take 5ms.

This is how I understand the situation described in https://bugs.python.org/issue7946 which I’ve went over a bit in https://github.com/maartenbreddels/fastblog/issues/3#issuecomment-760891430 (comment section of https://www.maartenbreddels.com/perf/jupyter/python/tracing/gil/2021/01/14/Tracing-the-Python-GIL.html )

I think giltracer might tell you if the picture painted here is correct, if that is the case, there are no good solutions I think. Workarounds/bandaids I can think of:

  • call a GIL releasing function from the other threads (e.g time.sleep(0)) to
  • call/change sys.setswitchinterval

Possibly you may be able to identify a c-function that does not release the GIL (unlikely), or you can make a strong case for doing a c-extension to work around this. This is actually the primary reason I build this tool and wrote the article, I want to make sure before building a C version of a ThreadPoolExector that it’s worth it (and still not sure about it).

2reactions
mrocklincommented, Feb 12, 2021

For performance at scale I’d personally be happy with favoring Linux 😃

On Fri, Feb 12, 2021 at 3:45 PM jakirkham notifications@github.com wrote:

Sorry I think I misunderstood sendmsg. That’s probably sufficient.

Do you know if this would work on Windows or is it limited to Unix (or just Linux)?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/4443#issuecomment-778471846, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTHRX4CW6FYNKYEPUQ3S6WOP5ANCNFSM4WJNIRZQ .

Read more comments on GitHub >

github_iconTop Results From Across the Web

Profiling the Airflow Scheduler With Flame Graphs - Astronomer
Find out how to benchmark and profile it using py-spy and Flame Graphs. ... 2.0 is to “Improve Scheduler performance and reliability”.
Read more >
Linux Scheduler profiling - Skillsire
Designed various bench-marks to profile over-heads, latencies, run-time and behavior of different schedulers in a Linux machine.
Read more >
Linux scheduler profiling - 1 - Oliver Yang
Our scheduling perf profiling goal is to understand how scheduler behaves from these 3 perspectives, under a certain workload or benchmark.
Read more >
perf sched for Linux CPU scheduler analysis - Brendan Gregg
The size and rate will be relative to your workload and number of CPUs (this example is an 8 CPU server running a...
Read more >
Scheduler profiling is disabled - Datadog Docs
Profiling allows for the identification of specific performance bottlenecks. It generates a significant amount of program data that could potentially be ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found