question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multiple ThreadPoolExecutors

See original GitHub issue

(I think that I’ve raised this before, but I couldn’t find it. I suspect that it was part of commentary on an issue rather than a standalone issue itself)

Today we run all tasks in a ThreadPoolExecutor living at Worker.executor. We default the size of this executor to the number of logical CPU cores on a machine. This works great most of the time, but there are some cases where we would like something different.

  1. I/O related tasks we could consider running on the event loop itself, or with a separate Tornado based AsyncExecutor
  2. For GPU related tasks we would prefer to have a separate executor with a single thread (or in the near future a few threads)
  3. For noxious tasks that leak memory folks have asked for a separate ProcessPoolExecutor
  4. Some folks have asked for a special executor for restricted resource tasks
  5. Actors run today on their own executor

In practice, the GPU pool is probably the most common case today.

So perhaps we should encode multiple executors into the Worker, and have tasks split between them based on annotations/resources/gpu flags.

executor = self.executors[task.executor or "cpu"]
self.submit_on_executor(executor, task, *args, **kwargs)

cc @dask/gpu

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Jun 4, 2021

@jakirkham : that’s already the case, and you could set the fsspec backend’s loop to be the one it needs to be; but zarr will still do its part decoding synchronously. You’d have to pass the filters down to the storage layer and replicate the work there - but then it would no longer be pure IO.

I think a rewrite in which we can fetch multiple blocks of bytes in a single task and pass to a separate dataframe-making task (without concat!) would work well for CSV. Parquet and just about anything else where we don’t pass bytes around is more complicated. Fastparquet, for example, isn’t interested in running in multiple threads like arrow can because “dask can solve that case” (not that it does a good job of releasing the GIL).

Note that the PR I linked above for fastparquet improved dataset open time by 10x for on s3 and without _metadata (one of the test datasets with many files).

1reaction
mrocklincommented, Jun 4, 2021

I suspect that if we had layers that were strictly IO

This is almost never the case

Yeah, to be clear, I’m saying that if we were to change how dask collections handle IO, by moving read_bytes calls into fully separable tasks, then we could take advantage of this. You had mentioned this in the past I think.

It wouldn’t work for Zarr, you’re right, because that abstraction hides I/O from us, but it could work for Parquet, CVS, and others if we wanted to make that explicit split. I’m not suggesting that we do this today, or any time in the moderate future.

Read more comments on GitHub >

github_iconTop Results From Across the Web

java - What happens when a single program has multiple ...
A ThreadPoolExecutor instance manages threads: it is responsible for ... that tells you to create multiple executors for some scenario.
Read more >
ThreadPoolExecutor in Python: The Complete Guide
This is helpful if you want to perform waiting operations across multiple thread pools that are executing different types of tasks. Both ...
Read more >
concurrent.futures — Launching parallel tasks — Python 3.11 ...
The asynchronous execution can be performed with threads, using ThreadPoolExecutor , or separate processes, using ProcessPoolExecutor .
Read more >
Python ThreadPoolExecutor By Practical Examples
A thread pool is a pattern for managing multiple threads efficiently. Use ThreadPoolExecutor class to manage a thread pool in Python. Call the...
Read more >
Java Thread Pools and ThreadPoolExecutor - HowToDoInJava
Lets look at a very basic example of thread pool executor in java and learn ... a given task at a single point...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found