Add `batch_size` parameter to `Task.map`
See original GitHub issueUse Case
Please provide a use case to help us understand your request in context When you have hundreds of thousands of tasks to map against, it may overload or take a very long time for the DaskExecutor (and the underlying Dask scheduler) to start up.
Solution
Please describe your ideal solution
results = my_task.map(*iterables, batch_size=10)
batch_size
is None
but can be set to completely process the data in that batch prior to moving on to the next batch.
I emphasize this because distributed just added a batch_size
parameter that adds tasks to the scheduler in batches rather than fully processes items in batches.
Alternatives
Please describe any alternatives you’ve considered, even if you’ve dismissed them
Some context on why the “fully process this batch vs add this batch” is important. We use a library to read images that utilizes dask under the hood. In doing so, if we want to process thousands of images, each image in the map
call may result in thousands more tasks. So even if the data is added in batches to the scheduler, as those batches are added, they are creating thousands of tasks themselves, which overloads the scheduler anyway
Issue Analytics
- State:
- Created 3 years ago
- Reactions:12
- Comments:10
I’m looking for something similar as asked here (batch mapped tasks or nested mapping). I’m wondering if there are any updates on this @jcrist? I think it would be very interesting feature for Prefect 🤓
What I’m after is:
Basically I want to write something as
Or something proposed above, that it automatically processes the list in batches.
Full disclosure, I already asked it in Slack but haven’t got a reply yet. Additionally this has come up earlier:
+1 this would pretty valuable for when I want to map an iterable of 300k+