question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DynUnet get_feature_maps() issue when using distributed training

See original GitHub issue

Describe the bug If we use DistributedDataParallel to wrap the network, calling the ‘get_feature_maps’ function will raise the following error:

Screen Shot 2021-02-08 at 6 51 01 PM

It is common to return multiple variables in a network’s forward function, not only for this case, but also for other situations such as for calculating triplet loss in metric learning based tasks. Therefore, here I think we should re-consider about what to return in the forward function of DynUnet.

Let me think about it and submit a PR then we can discuss it later. @Nic-Ma @wyli @rijobro

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
yiheng-wang-nvcommented, Feb 19, 2021

Hi @rijobro , thanks for the advice, but return this kind of dict doesn’t fix the torchscript issue as @Nic-Ma mentioned. I suggest here we still return a list, and add the corresponding docstrings. In order to help users use this network, I will also add/update tutorials. Let me submit a PR first for you to review.

1reaction
Nic-Macommented, Feb 9, 2021

Hi @rijobro ,

I think most of the previous problems are due to the list output during validation or inference. So I suggest to return a list of data during training, return only the output instead of [out] during validation or inference.

Thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Source code for monai.networks.nets.dynunet
Module): """ This reimplementation of a dynamic UNet (DynUNet) is based on: `Automated Design of Deep Learning Methods for Biomedical Image Segmentation ...
Read more >
Demystifying Developers' Issues in Distributed Training of ...
Specifically, we collect a dataset of 1,054 distributed-training-related developers' issues that occur during the use of these frameworks from ...
Read more >
Distributed and Parallel Training Tutorials - PyTorch
This tutorial demonstrates how to get started with RPC-based distributed training. Code. Implementing a Parameter Server Using Distributed RPC Framework. This ...
Read more >
How distributed training works in Pytorch - AI Summer
In this tutorial, we will learn how to use nn.parallel.DistributedDataParallel for training our models in multiple GPUs.
Read more >
Distributed Training - Determined AI Documentation
Access to all IP addresses of every node in the Trial (through the ClusterInfo API). Communication primitives such as allgather() , gather() ,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found