question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add `batch_size` parameter to `Task.map`

See original GitHub issue

Use Case

Please provide a use case to help us understand your request in context When you have hundreds of thousands of tasks to map against, it may overload or take a very long time for the DaskExecutor (and the underlying Dask scheduler) to start up.

Solution

Please describe your ideal solution

results = my_task.map(*iterables, batch_size=10)

batch_size is None but can be set to completely process the data in that batch prior to moving on to the next batch.

I emphasize this because distributed just added a batch_size parameter that adds tasks to the scheduler in batches rather than fully processes items in batches.

Alternatives

Please describe any alternatives you’ve considered, even if you’ve dismissed them Some context on why the “fully process this batch vs add this batch” is important. We use a library to read images that utilizes dask under the hood. In doing so, if we want to process thousands of images, each image in the map call may result in thousands more tasks. So even if the data is added in batches to the scheduler, as those batches are added, they are creating thousands of tasks themselves, which overloads the scheduler anyway

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:12
  • Comments:10

github_iconTop GitHub Comments

5reactions
joelluijmescommented, Feb 19, 2021

I’m looking for something similar as asked here (batch mapped tasks or nested mapping). I’m wondering if there are any updates on this @jcrist? I think it would be very interesting feature for Prefect 🤓

What I’m after is:

  1. Retrieve some dynamic list (i.e. query from database)
  2. Batch the result set in 20ish items to process in parallel
  3. For each item in the batch, do a branch of chained tasks
  4. Wait for batch to complete, repeat for next batch until exhausted

Basically I want to write something as

tasks = dynamic_list_of_tasks()
windowed_tasks = fixed_window(tasks, window_size=5)

def process_item(item):
    x = task_1()
    task_2(x)

def process_window(tasks):
    apply_map(process_item, tasks)

apply_map(process_window, windowed_tasks)

Or something proposed above, that it automatically processes the list in batches.

Full disclosure, I already asked it in Slack but haven’t got a reply yet. Additionally this has come up earlier:

2reactions
gregjohnsocommented, Jun 29, 2020

+1 this would pretty valuable for when I want to map an iterable of 300k+

Read more comments on GitHub >

github_iconTop Results From Across the Web

Batch Size - Informatica Documentation
Specify a batch size for the maximum number of files to be transferred in a batch in the source properties of the mass...
Read more >
Introducing AWS Lambda batching controls for message ...
To configure Maximum Batching on a function in the AWS Management Console, navigate to the function in the Lambda console. Create a new...
Read more >
Batch Processing in SSIS – SQLServerCentral Forums
In the parameter mapping, map your SSIS variable BatchSize to the 0 ordinal (this corresponds with the question mark placeholder).
Read more >
YOLOv5 Study: mAP vs Batch-Size · Discussion #2452 - GitHub
Multi-GPU may add another angle to the above story though, as larger batch sizes there may help contribute to better results, at least...
Read more >
How to Configure Multiprocessing Pool.map() Chunksize
Each task must be issued to the pool and added to an internal queue before ... Perhaps a better name for the argument...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found