import cv2 + nvidia/pytorch:22.09-py3 + DistributedDataParallel. (FIND was unable to find an engine)
See original GitHub issueEDIT: the bug is reproducable in the newest nvidia/pytorch:22.09-py3 docker container, but is not reproducible in older container (older pytorch/cudnn)
Something in MetaTensor makes DistributedDataParallel fail (this is in addition to this bug https://github.com/Project-MONAI/MONAI/issues/5283)
For example this code fails
import torch.distributed as dist
import torch
from monai.data import MetaTensor
#from monai.config.type_definitions import NdarrayTensor
from torch.cuda.amp import autocast
torch.autograd.set_detect_anomaly(True)
def main():
ngpus_per_node = torch.cuda.device_count()
torch.multiprocessing.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node,))
def main_worker(rank, ngpus_per_node):
print(f"rank {rank}")
dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=ngpus_per_node, rank=rank)
torch.backends.cudnn.benchmark = True
model = torch.nn.Conv3d(in_channels=1, out_channels=32, kernel_size=3, bias=True).to(rank)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[rank], output_device=rank, find_unused_parameters=False)
x = torch.ones(1, 1, 192, 192, 192).to(rank)
with autocast(enabled=True):
out = model(x)
print("Done.", out.shape)
if __name__ == "__main__":
main()
with error
-- Process 6 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/mnt/amproj/Code/automl/tasks/hecktor22/autoconfig_segresnet/test_monai.py", line 29, in main_worker
out = model(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1015, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 976, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 613, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 608, in _conv_forward
return F.conv3d(
RuntimeError: FIND was unable to find an engine to execute this computation
The MetaTensor is actually never used/initialized here, but something it it (or it’s imports) makes the code fail. Since we import MetaTensor everywhere, any code with it fails. I’ve traced it down to this import (inside of MetaTensor.py)
from monai.config.type_definitions import NdarrayTensor
importing this line also makes the code fail.
Somehow it confuses conv3d operation, and possibly other operations
Issue Analytics
- State:
- Created a year ago
- Comments:16 (9 by maintainers)
Top Results From Across the Web
import cv2 and DistributedDataParallel. (FIND was unable to ...
it seems it's triggered by import cv2 , on driver 470.82.01 and nvcr.io/nvidia/pytorch:22.09-py3 (the root cause is not really from ...
Read more >"DLL load failed" when import cv2 (opencv) - Stack Overflow
This can happen if you are using windows 10 N distribution, the N distributions does not come pre installed with windows media feature...
Read more >DistributedDataParallel constructor hangs when using nccl
Bug DistributedDataParallel hangs on the constructor call when ... from torch.nn.parallel import DistributedDataParallel class ToyModel(nn.
Read more >Distributed Data Parallel — PyTorch 1.13 documentation
After that, parameters on the local model will be updated, and all models on different processes should be exactly the same. import torch...
Read more >installed opencv but can't import cv2 - You.com
Still in Anaconda prompt, run pip install opencv-python; When you use conda list you should see a single open cv present. In your...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I see, very good, thank you guys
It’s already been addressed by https://github.com/Project-MONAI/MONAI/pull/5293 (by not importing
cv2
, https://github.com/Project-MONAI/MONAI/blob/dev/monai/__init__.py#L50), with a test case included. What Nic mentions is a possible alternative solution in casecv2
is imported for some other purposes.