Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add `batch_size` parameter to `Task.map`

See original GitHub issue

Use Case

Please provide a use case to help us understand your request in context When you have hundreds of thousands of tasks to map against, it may overload or take a very long time for the DaskExecutor (and the underlying Dask scheduler) to start up.

Solution

Please describe your ideal solution

results = my_task.map(*iterables, batch_size=10)

batch_size is None but can be set to completely process the data in that batch prior to moving on to the next batch.

I emphasize this because distributed just added a batch_size parameter that adds tasks to the scheduler in batches rather than fully processes items in batches.

Alternatives

Please describe any alternatives you’ve considered, even if you’ve dismissed them Some context on why the “fully process this batch vs add this batch” is important. We use a library to read images that utilizes dask under the hood. In doing so, if we want to process thousands of images, each image in the map call may result in thousands more tasks. So even if the data is added in batches to the scheduler, as those batches are added, they are creating thousands of tasks themselves, which overloads the scheduler anyway

Issue Analytics

State:
Created 3 years ago
Reactions:12
Comments:10

Top GitHub Comments

5reactions

joelluijmescommented, Feb 19, 2021

I’m looking for something similar as asked here (batch mapped tasks or nested mapping). I’m wondering if there are any updates on this @jcrist? I think it would be very interesting feature for Prefect 🤓

What I’m after is:

Retrieve some dynamic list (i.e. query from database)
Batch the result set in 20ish items to process in parallel
For each item in the batch, do a branch of chained tasks
Wait for batch to complete, repeat for next batch until exhausted

Basically I want to write something as

tasks = dynamic_list_of_tasks()
windowed_tasks = fixed_window(tasks, window_size=5)

def process_item(item):
    x = task_1()
    task_2(x)

def process_window(tasks):
    apply_map(process_item, tasks)

apply_map(process_window, windowed_tasks)

Or something proposed above, that it automatically processes the list in batches.

Full disclosure, I already asked it in Slack but haven’t got a reply yet. Additionally this has come up earlier:

2reactions

gregjohnsocommented, Jun 29, 2020

+1 this would pretty valuable for when I want to map an iterable of 300k+