question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

multiprocessing in dataset map "can only test a child process"

See original GitHub issue

Using a dataset with a single ‘text’ field and a fast tokenizer in a jupyter notebook.

def tokenizer_fn(example):
    return tokenizer.batch_encode_plus(example['text'])

ds_tokenized = text_dataset.map(tokenizer_fn, batched=True, num_proc=6, remove_columns=['text'])
---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/multiprocess/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 156, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/datasets/fingerprint.py", line 163, in wrapper
    out = func(self, *args, **kwargs)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1510, in _map_single
    for i in pbar:
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/notebook.py", line 228, in __iter__
    for obj in super(tqdm_notebook, self).__iter__(*args, **kwargs):
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/std.py", line 1186, in __iter__
    self.close()
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/notebook.py", line 251, in close
    super(tqdm_notebook, self).close(*args, **kwargs)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/std.py", line 1291, in close
    fp_write('')
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/std.py", line 1288, in fp_write
    self.fp.write(_unicode(s))
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/lib/redirect.py", line 91, in new_write
    cb(name, data)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/wandb_run.py", line 598, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 146, in publish_output
    self._publish_output(o)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 151, in _publish_output
    self._publish(rec)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 431, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
"""

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

4reactions
lhoestqcommented, Nov 13, 2020

It looks like an issue with wandb/tqdm here. We’re using the multiprocess library instead of the multiprocessing builtin python package to support various types of mapping functions. Maybe there’s some sort of incompatibility.

Could you make a minimal script to reproduce or a google colab ?

1reaction
gaceladricommented, Feb 1, 2021

I’m having a similar issue but when I try to do multiprocessing with the DataLoader

Code to reproduce:

from datasets import load_dataset

book_corpus = load_dataset('bookcorpus', 'plain_text', cache_dir='/home/ad/Desktop/bookcorpus', split='train[:1%]')
book_corpus = book_corpus.map(encode, batched=True, num_proc=20, load_from_cache_file=True, batch_size=5000)
book_corpus.set_format(type='torch', columns=['text', "input_ids", "attention_mask", "token_type_ids"])

from transformers import DataCollatorForWholeWordMask
from transformers import Trainer, TrainingArguments

data_collator = DataCollatorForWholeWordMask(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

training_args = TrainingArguments(
    output_dir="./mobile_linear_att_8L_128_128_03layerdrop_shared",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=64,
    save_steps=50,
    save_total_limit=2,
    logging_first_step=True,
    warmup_steps=100,
    logging_steps=50,
    gradient_accumulation_steps=1,
    fp16=True,
    **dataloader_num_workers=10**,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=book_corpus,
    tokenizer=tokenizer)

trainer.train()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<timed eval> in <module>

~/anaconda3/envs/tfm/lib/python3.6/site-packages/transformers/trainer.py in train(self, model_path, trial)
    869             self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
    870 
--> 871             for step, inputs in enumerate(epoch_iterator):
    872 
    873                 # Skip past any already trained steps if resuming training

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    433         if self._sampler_iter is None:
    434             self._reset()
--> 435         data = self._next_data()
    436         self._num_yielded += 1
    437         if self._dataset_kind == _DatasetKind.Iterable and \

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _next_data(self)
   1083             else:
   1084                 del self._task_info[idx]
-> 1085                 return self._process_data(data)
   1086 
   1087     def _try_put_index(self):

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_data(self, data)
   1109         self._try_put_index()
   1110         if isinstance(data, ExceptionWrapper):
-> 1111             data.reraise()
   1112         return data
   1113 

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/_utils.py in reraise(self)
    426             # have message field
    427             raise self.exc_type(message=msg)
--> 428         raise self.exc_type(msg)
    429 
    430 

AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1087, in __getitem__
    format_kwargs=self._format_kwargs,
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1074, in _getitem
    format_kwargs=format_kwargs,
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 890, in _convert_outputs
    v = map_nested(command, v, **map_nested_kwargs)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/utils/py_utils.py", line 225, in map_nested
    return function(data_struct)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 851, in command
    return torch.tensor(x, **format_kwargs)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/warnings.py", line 101, in _showwarnmsg
    _showwarnmsg_impl(msg)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/warnings.py", line 30, in _showwarnmsg_impl
    file.write(text)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
    cb(name, data)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/wandb_run.py", line 723, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 153, in publish_output
    self._publish_output(o)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 158, in _publish_output
    self._publish(rec)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 456, in _publish
    if self._process and not self._process.is_alive():
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

As a workaround I have commented line 456 and 457 in /home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py

Read more comments on GitHub >

github_iconTop Results From Across the Web

Map multiprocessing Issue - Datasets - Hugging Face Forums
I'm getting this issue when I am trying to map-tokenize a large custom data set. Looks like a multiprocessing issue.
Read more >
Python3 multiprocessing can only test a child process
But I have a Test application which executes Test cases through designated plugins. When the app is executed it creates a separate multi...
Read more >
multiprocessing.shared_memory — Shared memory for direct ...
This module provides a class, SharedMemory , for the allocation and management of shared memory to be accessed by one or more processes...
Read more >
4 Essential Parts of Multiprocessing in Python
Multiprocessing refers to running multiple processes simultaneously, which can be incredibly useful for speeding up your code and handling large ...
Read more >
torch.utils.data.dataloader - Neural Network Intelligence
IterableDataset ` interacts with `Multi-process data loading`_. .. warning:: ... Map if sampler is not None and shuffle: raise ValueError('sampler option is ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found