Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: received 0 items of ancdata

See original GitHub issue

“Stochastic” issue happening with training at some point. Training starts okay for x number of epochs and at some point this often happens with Pytorch Lightning (quite close still to the Build a segmentation workflow (with PyTorch Lightning) ) , and is probably propagating from Pytorch code? (e.g. https://github.com/fastai/fastai/issues/23)

reduction.py", line 161, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata
TypeError: 'NoneType' object is not iterable

Which I thought was happening first with the CacheDataset as it was quite RAM-intensive?:

train_ds = CacheDataset(data=datalist_train, transform=train_trans, cache_rate=1, num_workers=4)
val_ds = CacheDataset(data=datalist_val, transform=val_trans, cache_rate=1, num_workers=4)

but the same behavior was happening with the vanilla loader

train_ds = Dataset(data=datalist_train, transform=train_trans)
val_ds = Dataset(data=datalist_val, transform=val_trans)

with the following transformation

I guess this depends on environment in which the code is run, but do you have any ideas how to get rid of this?

Full trace:


MONAI version: 0.2.0
Python version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)  [GCC 7.3.0]
Numpy version: 1.18.5
Pytorch version: 1.5.0

Optional dependencies:
Pytorch Ignite version: 0.3.0
Nibabel version: 3.1.0
scikit-image version: 0.17.2
Pillow version: 7.1.2
Tensorboard version: 2.2.2

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type     | Params
-------------------------------------------
0 | _model        | UNet     | 4 M   
1 | loss_function | DiceLoss | 0     
Validation sanity check: 100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.89s/it]
Validation sanity check: 100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.89s/it]
current epoch: 0 current mean loss: 0.6968 best mean loss: 0.6968 (best dice at that loss 0.0061) at epoch 0
Epoch 1: 100%|███████████████████████████████████████████████████████████████████████████████| 222/222 [08:21<00:00,  2.26s/it, loss=0.604, v_num=0]
...
Epoch 41:  83%|████████████████████████████████████████████████████████████████▋             | 184/222 [06:40<01:22,  2.18s/it, loss=0.281, v_num=0]
Traceback (most recent call last):                                                                                                                  
  trainer.fit(net)
  site-packages/pytorch_lightning/trainer/trainer.py", line 918, in fit
    self.single_gpu_train(model)
  site-packages/pytorch_lightning/trainer/distrib_parts.py", line 176, in single_gpu_train
    self.run_pretrain_routine(model)
  site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in run_pretrain_routine
    self.train()
  site-packages/pytorch_lightning/trainer/training_loop.py", line 375, in train
    self.run_training_epoch()
  site-packages/pytorch_lightning/trainer/training_loop.py", line 445, in run_training_epoch
    enumerate(_with_is_last(train_dataloader)), "get_train_batch"
  site-packages/pytorch_lightning/profiler/profilers.py", line 64, in profile_iterable
    value = next(iterator)
  site-packages/pytorch_lightning/trainer/training_loop.py", line 844, in _with_is_last
    for val in it:
  site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
    success, data = self._try_get_data()
  site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
    fd = df.detach()
  File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
    len(ancdata))

RuntimeError: received 0 items of ancdata
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  site-packages/tqdm/std.py", line 1086, in __del__
  site-packages/tqdm/std.py", line 1293, in close
  site-packages/tqdm/std.py", line 1471, in display
  site-packages/tqdm/std.py", line 1089, in __repr__
  site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: 'NoneType' object is not iterable

Issue Analytics

State:
Created 3 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

13reactions

cuge1995commented, Jan 26, 2021

import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))

Thanks, it solved my problem.

8reactions

petteriTeikaricommented, Jul 25, 2020

@Nic-Ma : I came across again this error when trying a custom loss outside the Monai and added one more volume to the dataloader, thus giving CPUs more stuff to compute. I could not get past 1st epoch.

Apparently it is the multiprocessing that is causing the headaches, thus making the problem a bit local to my machine and hard to reproduce.

And tried some of the fixes from there https://github.com/pytorch/pytorch/issues/973, and got at least past the first epoch, and will see how robust these workarounds are

https://github.com/pytorch/pytorch/issues/973#issuecomment-604473515:

torch.multiprocessing.set_sharing_strategy('file_system')

https://github.com/pytorch/pytorch/issues/973#issuecomment-310397437:

pool = torch.multiprocessing.Pool(torch.multiprocessing.cpu_count(), maxtasksperchild=1)

https://github.com/pytorch/pytorch/issues/973#issuecomment-346405667:

import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))

@sampathweb had that suggestion for debugging https://github.com/pytorch/pytorch/issues/973#issuecomment-345089046: If the core devs want to see the error, just reduce the ulimit to 1024 and run the code of @kamo-naoyuki above and you might see the same problem.