question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Investigate if model_without_ddp is needed

See original GitHub issue

🐛 Describe the bug

Investigate if we need model_without_ddp in the training script. https://github.com/pytorch/vision/blob/12fd3a625a044a454cca3dbb2187e78efe1b4596/references/classification/train.py#L201

Versions

N/A

cc @datumbox

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
datumboxcommented, Sep 15, 2021

@fmassa Thanks for providing background on why this was added.

So basically, this workaround increases user-friendliness on how the weights are handled after the training is completed (hence outside of the train.py script).

Two thoughts on eliminating the non-parallelized version:

  1. For the users of our library that just take the pre-trained weights this has no effect. It’s us the contributors who train the models, prepare them (verify, produce hashes, load on S3 etc) and make them available. So we could easily adjust our process to do the extra step of removing .module without real issues.
  2. For the users of the references, this could potentially become a source of frustration as users would have to take the checkpoints, remove .module with the aforementioned method and then use the weights.

I don’t have a very strong opinion over this, but I’m leaning towards keeping it for the time being. Yes it’s a bit annoying to keep the non-parallelized version around but it does eliminate potential frustration for new users of the library. Thoughts?

1reaction
fmassacommented, Sep 14, 2021

@prabhat00155 what you would have needed would be to try to load the serialized checkpoint from a model that hasn’t been wrapped up in DDP yet.

Something as simple as

model = torchvision.models.resnet50()
model.load_state_dict(path_to_checkpoint)

would fail due to the added module. that gets appended due to DDP, so you would need to use tools like torch.nn.modules.utils.consume_prefix_in_state_dict_if_present, which are pretty new and was added to PyTorch less than 6 months ago https://github.com/pytorch/pytorch/pull/53224

I would be ok removing the current model_without_ddp in torchvision if we use a newer and better way that is provided by PyTorch, but I’m not sure that the current torch.nn.modules.utils.consume_prefix_in_state_dict_if_present is enough for that (at least it would need some thinking to be able to make sure all cases are handled properly)

Read more comments on GitHub >

github_iconTop Results From Across the Web

The population dynamical consequences of density ... - CORE
(1993) model was developed to investigate the possible role of pathogens in the cyclic dynamics of forest insect pests. We consider a system....
Read more >
(PDF) The population dynamical consequences of density ...
We develop a general host-pathogen model and assess the role of DDP on the population dynamics. The ability of DDP to drive population...
Read more >
Investigate if model_without_ddp is needed - Giters
While running a fairly limited training experiment for #4381, it was observed that model worked without any issue instead of model_without_ddp .
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found