Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consider defaulting k8s_api_threadpool_workers to c.JupyterHub.concurrent_spawn_limit

See original GitHub issue

c.KubeSpawner.k8s_api_threadpool_workers defaults to 5*ncpu [1] which is what a ThreadPoolExecutor in python defaults to as well [2]. The description of that option says:

Increase this if you are dealing with a very large number of users.

In our setup the core node where the hub pod runs is a 4CPU node because the hub doesn’t go beyond 1CPU. This means that by default k8s_api_threadpool_workers only has 20 workers.

The c.JupyterHub.concurrent_spawn_limit option defaults to 100 [3] but in zero-to-jupyterhub-k8s is set to 64 [4].

It seems that if you have a lot of users logging in and spawning notebook pods at the same time, like at the beginning of a large user event, you would want k8s_api_threadpool_workers aligned with concurrent_spawn_limit otherwise those spawn requests could be waiting on the thread pool.

We could default k8s_api_threadpool_workers to concurrent_spawn_limit or at least mention the relationship in the config option help docs between the two options.

[1] https://github.com/jupyterhub/kubespawner/blob/5521d573c272/kubespawner/spawner.py#L199 [2] https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor [3] https://jupyterhub.readthedocs.io/en/stable/api/app.html#jupyterhub.app.JupyterHub.concurrent_spawn_limit [4] https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/e4b9ce7eab5c17325e93975de1d6b4a200d47cd8/jupyterhub/values.yaml#L16

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:9 (6 by maintainers)

Top GitHub Comments

2reactions

betatimcommented, Jul 22, 2020

To decide if the thread pool size needs to be bigger or not I think we need a measurement off how many requests to use the threadpool end up having to wait. My intuition is similar to Erik’s in that I think each spawn should only use a slot in the threadpool for a second or so while it is sending the POST request. Maybe that isn’t true though which is where some measurements of how long these requests take and how often a request ends up getting queued before being executed would help.

1reaction

mriedemcommented, Jul 22, 2020

@consideRatio thanks for digging into this in detail.

I did change c.KubeSpawner.k8s_api_threadpool_workers to match c.JupyterHub.concurrent_spawn_limit in our z2jh extraConfig value like this:

c.KubeSpawner.k8s_api_threadpool_workers = c.JupyterHub.concurrent_spawn_limit

I’m assuming that worked since the hub started up fine but I’m not sure if the value was actually assigned correctly since I don’t know how to dump the hub’s settings at runtime [1].

Assuming it was correctly configured, I ran a load testing script to create 400 users (POST /users), start the user notebook servers (pods) in batches of 10 (using a ThreadPoolExecutor since the POST /users/{name}/server API can take a bit, about ~7-10 seconds in our environment), and then wait for them to be ready: True.

Comparing times between having c.KubeSpawner.k8s_api_threadpool_workers at the default (20 for us on a 4CPU core node) and then set to c.JupyterHub.concurrent_spawn_limit (64 per z2jh), it was slightly faster but only about 3% which is probably in the margin of error; I’m guessing if I ran both scenarios more times and averaged them out the gain wouldn’t be very noticeable. This likely reinforces the idea that the thread pool size is not an issue.

As for how this could be measured, I’m not really sure how to measure the time spent waiting in the pool for a Future to be executed. It might be possible to track overall time spent for the Future by using add_done_callback and passing in a partial function which has a start time and then calculates the end time when the callback is called to get the overall time spent for the Future, but that wouldn’t really tell us how long the Future is sitting in the pool, though it could be a reasonable warning flag if you set some threshold and log a warning if a request took x number of seconds to complete. I don’t see an easy way to track wait time in the thread pool from the standard library and sub-classing ThreadPoolExecutor to time things doesn’t seem like much fun either (guess it depends on your idea of fun). Other ideas?

[1] https://discourse.jupyter.org/t/is-there-a-way-to-dump-hub-app-settings-config/5305