question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Map is slow and processes batches one after another

See original GitHub issue

Describe the bug

I have a somewhat unclear bug to me, where I can’t figure out what the problem is. The code works as expected on a small subset of my dataset (2000 samples) on my local machine, but when I execute the same code with a larger dataset (1.4 million samples) this problem occurs. Thats why I can’t give exact steps to reproduce, I’m sorry.

I process a large dataset in a two step process. I first call map on a dataset I load from disk and create a new dataset from it. This works like expected and map uses all workers I started it with. Then I process the dataset created by the first step, again with map, which is really slow and starting only one or two process at a time. Number of processes is the same for both steps.

pseudo code:

ds = datasets.load_from_disk("path")
new_dataset = ds.map(work, batched=True, ...)  # fast uses all processes
final_dataset = new_dataset.map(work2, batched=True, ...)  # slow starts one process after another

Expected results

Second stage should be as fast as the first stage.

Versions

Paste the output of the following code:

  • Datasets: 1.5.0
  • Python: 3.8.8 (default, Feb 24 2021, 21:46:12)
  • Platform: Linux-5.4.0-60-generic-x86_64-with-glibc2.10

Do you guys have any idea? Thanks a lot!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
villmowcommented, Apr 30, 2021

Upgrade to 1.6.1 solved my problem somehow. I did not change any of my code, but now it starts all processes around the same time.

0reactions
lhoestqcommented, May 3, 2021

Nice ! I’m glad this works now. Closing for now, but feel free to re-open if you experience this issue again.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why does Map Jobs slow down after first set of Mappers are ...
Input size received by each mapper is almost same and the processing time each mapper should take should be more or less identical....
Read more >
Batch mapping - Hugging Face
The primary objective of batch mapping is to speed up processing. Often times, it is faster to work with batches of data instead...
Read more >
Lecture 12 MapReduce, Parallel Batch Processing · CS6963
Lecture 12 MapReduce, Parallel Batch Processing · In GFS, replicated, separate file per Reduce task · So output requires network communication -- slow...
Read more >
Value Stream Mapping: What You Need to Know
One process is faster than another – it's the slowest process which will slow down the whole value stream as others wait to...
Read more >
6 Major Hadoop Limitations With Their Solutions - Medium
Hadoop with its core Map-Reduce framework is unable to process real-time data. Hadoop process data in batches. First, the user loads the file ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found