Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CacheDataset crashes with runtime_cache=True, num_workers=0 in DDP

See original GitHub issue

This is because we try to convert a shared list (ListProxy) to a regular list, but must maintain a ListProxy since we’re in DDP, and need to access the shared items from different processes.

the exact error is this , that is unable to convert. But even if we can convert, we should not do in DDP, since need a ListProxy

EDIT: expected behavior below:

it should not crash
we should not call disable_share_memory_cache in DataLoader, if in DDP and using runtime_cache, even for num_workers==0, because we still need to use ListProxy (it is needed for different processes in DDP to read/write to the same cache indices). if we convert ListProxy -> List in such case, we will get memory copies in each process (potentially going OOM).

Traceback (most recent call last):                                                                                                                              
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap                                                               
    fn(i, *args)                                                                                                                                                
  File "/scripts/segmenter.py", line 1032, in run_segmenter_worker                                       
    best_metric = segmenter.run()                                                                                                                               
  File "//scripts/segmenter.py", line 696, in run
    self.train()
  File "/scripts/segmenter.py", line 732, in train
    train_loader = DataLoader(train_ds, batch_size=config["batch_size"], shuffle=(train_sampler is None),  num_workers=config["num_workers"], sampler=train_sampler, pin_memory=True)
  File "/mnt/amproj/Code/MONAI/monai/data/dataloader.py", line 87, in __init__
    dataset.disable_share_memory_cache() 
  File "/MONAI/monai/data/dataset.py", line 855, in disable_share_memory_cache
    self._cache = list(self._cache)
  File "<string>", line 2, in __len__
  File "/usr/lib/python3.8/multiprocessing/managers.py", line 831, in _callmethod
    self._connect()
  File "/usr/lib/python3.8/multiprocessing/managers.py", line 818, in _connect
    conn = self._Client(self._token.address, authkey=self._authkey)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 502, in Client
    c = SocketClient(address)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 630, in SocketClient
    s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory

Issue Analytics

State:
Created 10 months ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

Nic-Macommented, Nov 30, 2022

Hi @myron , @wyli ,

Sorry for the late response, I was busy with other project. I personally feel this sleep logic in @wyli 's PR is not very robust, the “0.5s” time is slightly hacking:

if torch.distributed.get_rank() == 0:
    time.sleep(0.5)  # rank 0 should exit after all the other ranks, as the cache was broadcast from rank 0

If we don’t have any other better solution, I would vote for @myron 's PR to drop the gpu-cache + runtime-cache + ddp feature. We drop the feature but make the overall logic more robust? What do you think?

Thanks.

0reactions

myroncommented, Nov 28, 2022

@Nic-Ma you decide please. I only need the shared cache working with num_workers==0 in DDP.

Top Results From Across the Web

possible deadlock in dataloader · Issue #1355 - GitHub

Everything works fine without DDP. However, when I create a dataloader for validation only on rank==0 , this dataloader freezes if num_workers>0 ...

Speed Up Model Training - PyTorch Lightning - Read the Docs

The problem is that PyTorch has issues with num_workers>0 when using .spawn() . For this reason, we recommend you use strategy="ddp" so you...

Data — MONAI 1.1.0 Documentation

CacheDataset executes non-random transforms and prepares cache content in the main process before the first epoch, then all the subprocesses of DataLoader will ......

DataLoader crashes when shuffling - Stack Overflow

RuntimeError : DataLoader worker (pid(s) 3978) exited unexpectedly. This error is because, In data.DataLoader(dataset, batch_size=32, ...

DataLoaders Explained: Building a Multi-Process Data Loader ...

while True: ... This data loader will spawn num_workers workers upon its ... else: # item isn't the one we want, cache for...