question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

map and filter not working properly in multiprocessing with the new release 2.6.0

See original GitHub issue

Describe the bug

When mapping is used on a dataset with more than one process, there is a weird behavior when trying to use filter , it’s like only the samples from one worker are retrieved, one needs to specify the same num_proc in filter for it to work properly. This doesn’t happen with datasets version 2.5.2

In the code below the data is filtered differently when we increase num_proc used in map although the datsets before and after mapping have identical elements.

Steps to reproduce the bug

import datasets
from datasets import load_dataset

def preprocess(example):
    return example

ds = load_dataset("codeparrot/codeparrot-clean-valid", split="train").select([i for i in range(10)])
ds1 = ds.map(preprocess, num_proc=2)
ds2 = ds.map(preprocess)

# the datasets elements are the same
for i in range(len(ds1)):
    assert ds1[i]==ds2[i]

print(f'Target column before filtering {ds1["autogenerated"]}')
print(f'Target column before filtering {ds2["autogenerated"]}')
print(f"datasets version {datasets.__version__}")

ds_filtered_1 = ds1.filter(lambda x: not x["autogenerated"])
ds_filtered_2 = ds2.filter(lambda x: not x["autogenerated"])

# all elements in Target column are false so they should all be kept, but for ds2 only the first 5=num_samples/num_proc are kept
print(ds_filtered_1)
print(ds_filtered_2)
Target column before filtering [False, False, False, False, False, False, False, False, False, False]
Target column before filtering [False, False, False, False, False, False, False, False, False, False]

Dataset({
    features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
    num_rows: 5
})
Dataset({
    features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
    num_rows: 10
})

Expected results

Increasing num_proc in mapping shouldn’t alter filtering. With the previous version 2.5.2 this doesn’t happen

Actual results

Filtering doesn’t work properly when we increase num_proc in mapping but not when calling filter

Environment info

  • datasets version: 2.6.0
  • Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
  • Python version: 3.9.13
  • PyArrow version: 8.0.0
  • Pandas version: 1.4.2

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:14 (10 by maintainers)

github_iconTop GitHub Comments

3reactions
lhoestqcommented, Oct 14, 2022

Doing a patch release asap 😃

1reaction
loubnabnlcommented, Oct 17, 2022

It works now, thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

multiprocessing.Pool() slower than just using ordinary functions
I do suspect that part of the problem here could be that the map function returns None and so adds thousands of unneccessary...
Read more >
multiprocessing — Process-based parallelism — Python 3.11 ...
Introduction. multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers ...
Read more >
multiprocess - PyPI
If you have a new contribution, please submit a pull request. Installation. multiprocess can be installed with pip: $ pip install multiprocess. For...
Read more >
MPIRE for Python: MultiProcessing Is Really Easy
These solutions work well and they can utilize multithreading for parallel ... By spawning new processes, we effectively side-step the GIL.
Read more >
Welcome to the MPIRE documentation! - GitHub Pages
It combines the convenient map like functions of multiprocessing.Pool with the benefits of using copy-on-write shared objects of multiprocessing.Process (not ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found