Profiling Scheduler Performance
See original GitHub issueWhen running Dask with TLS security turned on, the scheduler can easily become bottlenecked by SSL communication. See the performance report below, in particular the “Scheduler Profile” tab.
I ran this on my laptop with the following code
import dask
from dask.distributed import Client, performance_report, wait
client = Client(security=True)
dask.config.set({"optimization.fuse.active": False})
df = dask.datasets.timeseries(start="2020-01-01", end="2020-01-31", partition_freq="1h", freq="60s").persist()
with performance_report("report.html"):
df2 = df.set_index("x").persist()
wait(df2)
When this is run on a larger cluster with many workers this problem becomes significantly worse. What are some ways that we can reduce the cost of secure communication?
Issue Analytics
- State:
- Created 3 years ago
- Comments:83 (76 by maintainers)
Top Results From Across the Web
Profiling the Airflow Scheduler With Flame Graphs - Astronomer
Find out how to benchmark and profile it using py-spy and Flame Graphs. ... 2.0 is to “Improve Scheduler performance and reliability”.
Read more >Linux Scheduler profiling - Skillsire
Designed various bench-marks to profile over-heads, latencies, run-time and behavior of different schedulers in a Linux machine.
Read more >Linux scheduler profiling - 1 - Oliver Yang
Our scheduling perf profiling goal is to understand how scheduler behaves from these 3 perspectives, under a certain workload or benchmark.
Read more >perf sched for Linux CPU scheduler analysis - Brendan Gregg
The size and rate will be relative to your workload and number of CPUs (this example is an 8 CPU server running a...
Read more >Scheduler profiling is disabled - Datadog Docs
Profiling allows for the identification of specific performance bottlenecks. It generates a significant amount of program data that could potentially be ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@pitrou Just curious why you think 10ms, I’d guess 5ms from the default of sys.getswitchinterval
I’ve been thinking a bit what the best way is to describe why it is/looks so costly. Would you agree that it’s not perse the GIL switch that is costly (e.g the thread context switch is relatively cheap), but that often releasing the GIL in a thread within 5ms, while other threads don’t do that (like running pure Python code) will lead that thread too often having to wait for the GIL.
I think it is the attempt to return from the (Python)
send
function, which tries to obtain the GIL. The (Python)send
function released the GIL, does the send syscall, which returns in 1us, but wanting to return to Python land, it first needs to acquire the GIL. If other threads are not as eager to release the GIL (such as pure Python code, which will do that only after 5ms, or a long running c-function that does not release the GIL), send will effectively take 5ms.This is how I understand the situation described in https://bugs.python.org/issue7946 which I’ve went over a bit in https://github.com/maartenbreddels/fastblog/issues/3#issuecomment-760891430 (comment section of https://www.maartenbreddels.com/perf/jupyter/python/tracing/gil/2021/01/14/Tracing-the-Python-GIL.html )
I think giltracer might tell you if the picture painted here is correct, if that is the case, there are no good solutions I think. Workarounds/bandaids I can think of:
time.sleep(0)
) toPossibly you may be able to identify a c-function that does not release the GIL (unlikely), or you can make a strong case for doing a c-extension to work around this. This is actually the primary reason I build this tool and wrote the article, I want to make sure before building a C version of a ThreadPoolExector that it’s worth it (and still not sure about it).
For performance at scale I’d personally be happy with favoring Linux 😃
On Fri, Feb 12, 2021 at 3:45 PM jakirkham notifications@github.com wrote: