Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Map is slow and processes batches one after another

See original GitHub issue

Describe the bug

I have a somewhat unclear bug to me, where I can’t figure out what the problem is. The code works as expected on a small subset of my dataset (2000 samples) on my local machine, but when I execute the same code with a larger dataset (1.4 million samples) this problem occurs. Thats why I can’t give exact steps to reproduce, I’m sorry.

I process a large dataset in a two step process. I first call map on a dataset I load from disk and create a new dataset from it. This works like expected and map uses all workers I started it with. Then I process the dataset created by the first step, again with map, which is really slow and starting only one or two process at a time. Number of processes is the same for both steps.

pseudo code:

ds = datasets.load_from_disk("path")
new_dataset = ds.map(work, batched=True, ...)  # fast uses all processes
final_dataset = new_dataset.map(work2, batched=True, ...)  # slow starts one process after another

Expected results

Second stage should be as fast as the first stage.

Versions

Paste the output of the following code:

Datasets: 1.5.0
Python: 3.8.8 (default, Feb 24 2021, 21:46:12)
Platform: Linux-5.4.0-60-generic-x86_64-with-glibc2.10

Do you guys have any idea? Thanks a lot!

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

villmowcommented, Apr 30, 2021

Upgrade to 1.6.1 solved my problem somehow. I did not change any of my code, but now it starts all processes around the same time.

0reactions

lhoestqcommented, May 3, 2021

Nice ! I’m glad this works now. Closing for now, but feel free to re-open if you experience this issue again.

Top Results From Across the Web

Why does Map Jobs slow down after first set of Mappers are ...

Input size received by each mapper is almost same and the processing time each mapper should take should be more or less identical....

Batch mapping - Hugging Face

The primary objective of batch mapping is to speed up processing. Often times, it is faster to work with batches of data instead...

Lecture 12 MapReduce, Parallel Batch Processing · CS6963

Lecture 12 MapReduce, Parallel Batch Processing · In GFS, replicated, separate file per Reduce task · So output requires network communication -- slow...

Value Stream Mapping: What You Need to Know

One process is faster than another – it's the slowest process which will slow down the whole value stream as others wait to...

6 Major Hadoop Limitations With Their Solutions - Medium

Hadoop with its core Map-Reduce framework is unable to process real-time data. Hadoop process data in batches. First, the user loads the file ......