map and filter not working properly in multiprocessing with the new release 2.6.0
See original GitHub issueDescribe the bug
When mapping is used on a dataset with more than one process, there is a weird behavior when trying to use filter
, it’s like only the samples from one worker are retrieved, one needs to specify the same num_proc
in filter for it to work properly. This doesn’t happen with datasets
version 2.5.2
In the code below the data is filtered differently when we increase num_proc
used in map
although the datsets before and after mapping have identical elements.
Steps to reproduce the bug
import datasets
from datasets import load_dataset
def preprocess(example):
return example
ds = load_dataset("codeparrot/codeparrot-clean-valid", split="train").select([i for i in range(10)])
ds1 = ds.map(preprocess, num_proc=2)
ds2 = ds.map(preprocess)
# the datasets elements are the same
for i in range(len(ds1)):
assert ds1[i]==ds2[i]
print(f'Target column before filtering {ds1["autogenerated"]}')
print(f'Target column before filtering {ds2["autogenerated"]}')
print(f"datasets version {datasets.__version__}")
ds_filtered_1 = ds1.filter(lambda x: not x["autogenerated"])
ds_filtered_2 = ds2.filter(lambda x: not x["autogenerated"])
# all elements in Target column are false so they should all be kept, but for ds2 only the first 5=num_samples/num_proc are kept
print(ds_filtered_1)
print(ds_filtered_2)
Target column before filtering [False, False, False, False, False, False, False, False, False, False]
Target column before filtering [False, False, False, False, False, False, False, False, False, False]
Dataset({
features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
num_rows: 5
})
Dataset({
features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
num_rows: 10
})
Expected results
Increasing num_proc
in mapping shouldn’t alter filtering. With the previous version 2.5.2 this doesn’t happen
Actual results
Filtering doesn’t work properly when we increase num_proc
in mapping but not when calling filter
Environment info
datasets
version: 2.6.0- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.13
- PyArrow version: 8.0.0
- Pandas version: 1.4.2
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:14 (10 by maintainers)
Top Results From Across the Web
multiprocessing.Pool() slower than just using ordinary functions
I do suspect that part of the problem here could be that the map function returns None and so adds thousands of unneccessary...
Read more >multiprocessing — Process-based parallelism — Python 3.11 ...
Introduction. multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers ...
Read more >multiprocess - PyPI
If you have a new contribution, please submit a pull request. Installation. multiprocess can be installed with pip: $ pip install multiprocess. For...
Read more >MPIRE for Python: MultiProcessing Is Really Easy
These solutions work well and they can utilize multithreading for parallel ... By spawning new processes, we effectively side-step the GIL.
Read more >Welcome to the MPIRE documentation! - GitHub Pages
It combines the convenient map like functions of multiprocessing.Pool with the benefits of using copy-on-write shared objects of multiprocessing.Process (not ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Doing a patch release asap 😃
It works now, thanks!