CacheDataset crashes with runtime_cache=True, num_workers=0 in DDP
See original GitHub issueCacheDataset crashes with runtime_cache=True, num_workers=0 in DDP
This is because we try to convert a shared list (ListProxy) to a regular list, but must maintain a ListProxy since we’re in DDP, and need to access the shared items from different processes.
the exact error is this , that is unable to convert. But even if we can convert, we should not do in DDP, since need a ListProxy
EDIT: expected behavior below:
- it should not crash
- we should not call disable_share_memory_cache in DataLoader, if in DDP and using runtime_cache, even for num_workers==0, because we still need to use ListProxy (it is needed for different processes in DDP to read/write to the same cache indices). if we convert ListProxy -> List in such case, we will get memory copies in each process (potentially going OOM).
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/scripts/segmenter.py", line 1032, in run_segmenter_worker
best_metric = segmenter.run()
File "//scripts/segmenter.py", line 696, in run
self.train()
File "/scripts/segmenter.py", line 732, in train
train_loader = DataLoader(train_ds, batch_size=config["batch_size"], shuffle=(train_sampler is None), num_workers=config["num_workers"], sampler=train_sampler, pin_memory=True)
File "/mnt/amproj/Code/MONAI/monai/data/dataloader.py", line 87, in __init__
dataset.disable_share_memory_cache()
File "/MONAI/monai/data/dataset.py", line 855, in disable_share_memory_cache
self._cache = list(self._cache)
File "<string>", line 2, in __len__
File "/usr/lib/python3.8/multiprocessing/managers.py", line 831, in _callmethod
self._connect()
File "/usr/lib/python3.8/multiprocessing/managers.py", line 818, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 502, in Client
c = SocketClient(address)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 630, in SocketClient
s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory
Issue Analytics
- State:
- Created 10 months ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
possible deadlock in dataloader · Issue #1355 - GitHub
Everything works fine without DDP. However, when I create a dataloader for validation only on rank==0 , this dataloader freezes if num_workers>0 ...
Read more >Speed Up Model Training - PyTorch Lightning - Read the Docs
The problem is that PyTorch has issues with num_workers>0 when using .spawn() . For this reason, we recommend you use strategy="ddp" so you...
Read more >Data — MONAI 1.1.0 Documentation
CacheDataset executes non-random transforms and prepares cache content in the main process before the first epoch, then all the subprocesses of DataLoader will ......
Read more >DataLoader crashes when shuffling - Stack Overflow
RuntimeError : DataLoader worker (pid(s) 3978) exited unexpectedly. This error is because, In data.DataLoader(dataset, batch_size=32, ...
Read more >DataLoaders Explained: Building a Multi-Process Data Loader ...
while True: ... This data loader will spawn num_workers workers upon its ... else: # item isn't the one we want, cache for...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @myron , @wyli ,
Sorry for the late response, I was busy with other project. I personally feel this
sleep
logic in @wyli 's PR is not very robust, the “0.5s” time is slightly hacking:If we don’t have any other better solution, I would vote for @myron 's PR to drop the
gpu-cache + runtime-cache + ddp
feature. We drop the feature but make the overall logic more robust? What do you think?Thanks.
@Nic-Ma you decide please. I only need the shared cache working with num_workers==0 in DDP.