Map is slow and processes batches one after another
See original GitHub issueDescribe the bug
I have a somewhat unclear bug to me, where I can’t figure out what the problem is. The code works as expected on a small subset of my dataset (2000 samples) on my local machine, but when I execute the same code with a larger dataset (1.4 million samples) this problem occurs. Thats why I can’t give exact steps to reproduce, I’m sorry.
I process a large dataset in a two step process. I first call map on a dataset I load from disk and create a new dataset from it. This works like expected and map
uses all workers I started it with. Then I process the dataset created by the first step, again with map
, which is really slow and starting only one or two process at a time. Number of processes is the same for both steps.
pseudo code:
ds = datasets.load_from_disk("path")
new_dataset = ds.map(work, batched=True, ...) # fast uses all processes
final_dataset = new_dataset.map(work2, batched=True, ...) # slow starts one process after another
Expected results
Second stage should be as fast as the first stage.
Versions
Paste the output of the following code:
- Datasets: 1.5.0
- Python: 3.8.8 (default, Feb 24 2021, 21:46:12)
- Platform: Linux-5.4.0-60-generic-x86_64-with-glibc2.10
Do you guys have any idea? Thanks a lot!
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Upgrade to 1.6.1 solved my problem somehow. I did not change any of my code, but now it starts all processes around the same time.
Nice ! I’m glad this works now. Closing for now, but feel free to re-open if you experience this issue again.