Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

map and filter not working properly in multiprocessing with the new release 2.6.0

See original GitHub issue

Describe the bug

When mapping is used on a dataset with more than one process, there is a weird behavior when trying to use filter , it’s like only the samples from one worker are retrieved, one needs to specify the same num_proc in filter for it to work properly. This doesn’t happen with datasets version 2.5.2

In the code below the data is filtered differently when we increase num_proc used in map although the datsets before and after mapping have identical elements.

Steps to reproduce the bug

import datasets
from datasets import load_dataset

def preprocess(example):
    return example

ds = load_dataset("codeparrot/codeparrot-clean-valid", split="train").select([i for i in range(10)])
ds1 = ds.map(preprocess, num_proc=2)
ds2 = ds.map(preprocess)

# the datasets elements are the same
for i in range(len(ds1)):
    assert ds1[i]==ds2[i]

print(f'Target column before filtering {ds1["autogenerated"]}')
print(f'Target column before filtering {ds2["autogenerated"]}')
print(f"datasets version {datasets.__version__}")

ds_filtered_1 = ds1.filter(lambda x: not x["autogenerated"])
ds_filtered_2 = ds2.filter(lambda x: not x["autogenerated"])

# all elements in Target column are false so they should all be kept, but for ds2 only the first 5=num_samples/num_proc are kept
print(ds_filtered_1)
print(ds_filtered_2)

Target column before filtering [False, False, False, False, False, False, False, False, False, False]
Target column before filtering [False, False, False, False, False, False, False, False, False, False]

Dataset({
    features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
    num_rows: 5
})
Dataset({
    features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
    num_rows: 10
})

Expected results

Increasing num_proc in mapping shouldn’t alter filtering. With the previous version 2.5.2 this doesn’t happen

Actual results

Filtering doesn’t work properly when we increase num_proc in mapping but not when calling filter

Environment info

datasets version: 2.6.0
Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
Python version: 3.9.13
PyArrow version: 8.0.0
Pandas version: 1.4.2

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:14 (10 by maintainers)

Top GitHub Comments

3reactions

lhoestqcommented, Oct 14, 2022

Doing a patch release asap 😃

1reaction

loubnabnlcommented, Oct 17, 2022

It works now, thanks!

Top Results From Across the Web

multiprocessing.Pool() slower than just using ordinary functions

I do suspect that part of the problem here could be that the map function returns None and so adds thousands of unneccessary...

multiprocessing — Process-based parallelism — Python 3.11 ...

Introduction. multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers ...

multiprocess - PyPI

If you have a new contribution, please submit a pull request. Installation. multiprocess can be installed with pip: $ pip install multiprocess. For...

MPIRE for Python: MultiProcessing Is Really Easy

These solutions work well and they can utilize multithreading for parallel ... By spawning new processes, we effectively side-step the GIL.

Welcome to the MPIRE documentation! - GitHub Pages

It combines the convenient map like functions of multiprocessing.Pool with the benefits of using copy-on-write shared objects of multiprocessing.Process (not ...