question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CacheDataset crashes with runtime_cache=True, num_workers=0 in DDP

See original GitHub issue

CacheDataset crashes with runtime_cache=True, num_workers=0 in DDP

This is because we try to convert a shared list (ListProxy) to a regular list, but must maintain a ListProxy since we’re in DDP, and need to access the shared items from different processes.

the exact error is this , that is unable to convert. But even if we can convert, we should not do in DDP, since need a ListProxy

EDIT: expected behavior below:

  • it should not crash
  • we should not call disable_share_memory_cache in DataLoader, if in DDP and using runtime_cache, even for num_workers==0, because we still need to use ListProxy (it is needed for different processes in DDP to read/write to the same cache indices). if we convert ListProxy -> List in such case, we will get memory copies in each process (potentially going OOM).
Traceback (most recent call last):                                                                                                                              
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap                                                               
    fn(i, *args)                                                                                                                                                
  File "/scripts/segmenter.py", line 1032, in run_segmenter_worker                                       
    best_metric = segmenter.run()                                                                                                                               
  File "//scripts/segmenter.py", line 696, in run
    self.train()
  File "/scripts/segmenter.py", line 732, in train
    train_loader = DataLoader(train_ds, batch_size=config["batch_size"], shuffle=(train_sampler is None),  num_workers=config["num_workers"], sampler=train_sampler, pin_memory=True)
  File "/mnt/amproj/Code/MONAI/monai/data/dataloader.py", line 87, in __init__
    dataset.disable_share_memory_cache() 
  File "/MONAI/monai/data/dataset.py", line 855, in disable_share_memory_cache
    self._cache = list(self._cache)
  File "<string>", line 2, in __len__
  File "/usr/lib/python3.8/multiprocessing/managers.py", line 831, in _callmethod
    self._connect()
  File "/usr/lib/python3.8/multiprocessing/managers.py", line 818, in _connect
    conn = self._Client(self._token.address, authkey=self._authkey)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 502, in Client
    c = SocketClient(address)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 630, in SocketClient
    s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Nic-Macommented, Nov 30, 2022

Hi @myron , @wyli ,

Sorry for the late response, I was busy with other project. I personally feel this sleep logic in @wyli 's PR is not very robust, the “0.5s” time is slightly hacking:

if torch.distributed.get_rank() == 0:
    time.sleep(0.5)  # rank 0 should exit after all the other ranks, as the cache was broadcast from rank 0

If we don’t have any other better solution, I would vote for @myron 's PR to drop the gpu-cache + runtime-cache + ddp feature. We drop the feature but make the overall logic more robust? What do you think?

Thanks.

0reactions
myroncommented, Nov 28, 2022

@Nic-Ma you decide please. I only need the shared cache working with num_workers==0 in DDP.

Read more comments on GitHub >

github_iconTop Results From Across the Web

possible deadlock in dataloader · Issue #1355 - GitHub
Everything works fine without DDP. However, when I create a dataloader for validation only on rank==0 , this dataloader freezes if num_workers>0 ...
Read more >
Speed Up Model Training - PyTorch Lightning - Read the Docs
The problem is that PyTorch has issues with num_workers>0 when using .spawn() . For this reason, we recommend you use strategy="ddp" so you...
Read more >
Data — MONAI 1.1.0 Documentation
CacheDataset executes non-random transforms and prepares cache content in the main process before the first epoch, then all the subprocesses of DataLoader will ......
Read more >
DataLoader crashes when shuffling - Stack Overflow
RuntimeError : DataLoader worker (pid(s) 3978) exited unexpectedly. This error is because, In data.DataLoader(dataset, batch_size=32, ...
Read more >
DataLoaders Explained: Building a Multi-Process Data Loader ...
while True: ... This data loader will spawn num_workers workers upon its ... else: # item isn't the one we want, cache for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found