question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

import cv2 + nvidia/pytorch:22.09-py3 + DistributedDataParallel. (FIND was unable to find an engine)

See original GitHub issue

EDIT: the bug is reproducable in the newest nvidia/pytorch:22.09-py3 docker container, but is not reproducible in older container (older pytorch/cudnn)

Something in MetaTensor makes DistributedDataParallel fail (this is in addition to this bug https://github.com/Project-MONAI/MONAI/issues/5283)

For example this code fails

import torch.distributed as dist
import torch

from monai.data import MetaTensor
#from monai.config.type_definitions import NdarrayTensor

from torch.cuda.amp import autocast  
torch.autograd.set_detect_anomaly(True)

def main():

    ngpus_per_node = torch.cuda.device_count()
    torch.multiprocessing.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node,))

def main_worker(rank, ngpus_per_node):

    print(f"rank {rank}")

    dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=ngpus_per_node, rank=rank)
    torch.backends.cudnn.benchmark = True

    model = torch.nn.Conv3d(in_channels=1, out_channels=32, kernel_size=3, bias=True).to(rank)
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[rank], output_device=rank, find_unused_parameters=False)

    x = torch.ones(1, 1, 192, 192, 192).to(rank)
    with autocast(enabled=True):
        out = model(x)

    print("Done.", out.shape)

if __name__ == "__main__":
    main()

with error

-- Process 6 terminated with the following error:                                                                                                               
Traceback (most recent call last):                                                                                                                              
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap                                                               
    fn(i, *args)                                                                                                                                                
  File "/mnt/amproj/Code/automl/tasks/hecktor22/autoconfig_segresnet/test_monai.py", line 29, in main_worker                                                    
    out = model(x)                                                                                                                                              
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl                                                            
    return forward_call(*input, **kwargs)                                                                                                                       
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1015, in forward                                                         
    output = self._run_ddp_forward(*inputs, **kwargs)                                                                                                           
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 976, in _run_ddp_forward                                                 
    return module_to_run(*inputs[0], **kwargs[0])                                                                                                               
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl                                                            
    return forward_call(*input, **kwargs)                                                                                                                       
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 613, in forward                                                                  
    return self._conv_forward(input, self.weight, self.bias)                                                                                                    
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 608, in _conv_forward
    return F.conv3d(
RuntimeError: FIND was unable to find an engine to execute this computation

The MetaTensor is actually never used/initialized here, but something it it (or it’s imports) makes the code fail. Since we import MetaTensor everywhere, any code with it fails. I’ve traced it down to this import (inside of MetaTensor.py) from monai.config.type_definitions import NdarrayTensor

importing this line also makes the code fail.

Somehow it confuses conv3d operation, and possibly other operations

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:16 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
myroncommented, Oct 11, 2022

I see, very good, thank you guys

1reaction
wylicommented, Oct 11, 2022

It’s already been addressed by https://github.com/Project-MONAI/MONAI/pull/5293 (by not importing cv2, https://github.com/Project-MONAI/MONAI/blob/dev/monai/__init__.py#L50), with a test case included. What Nic mentions is a possible alternative solution in case cv2 is imported for some other purposes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

import cv2 and DistributedDataParallel. (FIND was unable to ...
it seems it's triggered by import cv2 , on driver 470.82.01 and nvcr.io/nvidia/pytorch:22.09-py3 (the root cause is not really from ...
Read more >
"DLL load failed" when import cv2 (opencv) - Stack Overflow
This can happen if you are using windows 10 N distribution, the N distributions does not come pre installed with windows media feature...
Read more >
DistributedDataParallel constructor hangs when using nccl
Bug DistributedDataParallel hangs on the constructor call when ... from torch.nn.parallel import DistributedDataParallel class ToyModel(nn.
Read more >
Distributed Data Parallel — PyTorch 1.13 documentation
After that, parameters on the local model will be updated, and all models on different processes should be exactly the same. import torch...
Read more >
installed opencv but can't import cv2 - You.com
Still in Anaconda prompt, run pip install opencv-python; When you use conda list you should see a single open cv present. In your...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found