multiprocessing in dataset map "can only test a child process"
See original GitHub issueUsing a dataset with a single ‘text’ field and a fast tokenizer in a jupyter notebook.
def tokenizer_fn(example):
return tokenizer.batch_encode_plus(example['text'])
ds_tokenized = text_dataset.map(tokenizer_fn, batched=True, num_proc=6, remove_columns=['text'])
---------------------------------------------------------------------------
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/multiprocess/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 156, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/datasets/fingerprint.py", line 163, in wrapper
out = func(self, *args, **kwargs)
File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1510, in _map_single
for i in pbar:
File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/notebook.py", line 228, in __iter__
for obj in super(tqdm_notebook, self).__iter__(*args, **kwargs):
File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/std.py", line 1186, in __iter__
self.close()
File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/notebook.py", line 251, in close
super(tqdm_notebook, self).close(*args, **kwargs)
File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/std.py", line 1291, in close
fp_write('')
File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/std.py", line 1288, in fp_write
self.fp.write(_unicode(s))
File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/lib/redirect.py", line 91, in new_write
cb(name, data)
File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/wandb_run.py", line 598, in _console_callback
self._backend.interface.publish_output(name, data)
File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 146, in publish_output
self._publish_output(o)
File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 151, in _publish_output
self._publish(rec)
File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 431, in _publish
if self._process and not self._process.is_alive():
File "/usr/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
"""
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:9 (6 by maintainers)
Top Results From Across the Web
Map multiprocessing Issue - Datasets - Hugging Face Forums
I'm getting this issue when I am trying to map-tokenize a large custom data set. Looks like a multiprocessing issue.
Read more >Python3 multiprocessing can only test a child process
But I have a Test application which executes Test cases through designated plugins. When the app is executed it creates a separate multi...
Read more >multiprocessing.shared_memory — Shared memory for direct ...
This module provides a class, SharedMemory , for the allocation and management of shared memory to be accessed by one or more processes...
Read more >4 Essential Parts of Multiprocessing in Python
Multiprocessing refers to running multiple processes simultaneously, which can be incredibly useful for speeding up your code and handling large ...
Read more >torch.utils.data.dataloader - Neural Network Intelligence
IterableDataset ` interacts with `Multi-process data loading`_. .. warning:: ... Map if sampler is not None and shuffle: raise ValueError('sampler option is ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
It looks like an issue with wandb/tqdm here. We’re using the
multiprocess
library instead of themultiprocessing
builtin python package to support various types of mapping functions. Maybe there’s some sort of incompatibility.Could you make a minimal script to reproduce or a google colab ?
I’m having a similar issue but when I try to do multiprocessing with the
DataLoader
Code to reproduce:
As a workaround I have commented line 456 and 457 in
/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py