Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Investigate if model_without_ddp is needed

See original GitHub issue

🐛 Describe the bug

Investigate if we need model_without_ddp in the training script. https://github.com/pytorch/vision/blob/12fd3a625a044a454cca3dbb2187e78efe1b4596/references/classification/train.py#L201

Versions

N/A

cc @datumbox

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:9 (9 by maintainers)

Top GitHub Comments

2reactions

datumboxcommented, Sep 15, 2021

@fmassa Thanks for providing background on why this was added.

So basically, this workaround increases user-friendliness on how the weights are handled after the training is completed (hence outside of the train.py script).

Two thoughts on eliminating the non-parallelized version:

For the users of our library that just take the pre-trained weights this has no effect. It’s us the contributors who train the models, prepare them (verify, produce hashes, load on S3 etc) and make them available. So we could easily adjust our process to do the extra step of removing .module without real issues.
For the users of the references, this could potentially become a source of frustration as users would have to take the checkpoints, remove .module with the aforementioned method and then use the weights.

I don’t have a very strong opinion over this, but I’m leaning towards keeping it for the time being. Yes it’s a bit annoying to keep the non-parallelized version around but it does eliminate potential frustration for new users of the library. Thoughts?

1reaction

fmassacommented, Sep 14, 2021

@prabhat00155 what you would have needed would be to try to load the serialized checkpoint from a model that hasn’t been wrapped up in DDP yet.

Something as simple as

model = torchvision.models.resnet50()
model.load_state_dict(path_to_checkpoint)

would fail due to the added module. that gets appended due to DDP, so you would need to use tools like torch.nn.modules.utils.consume_prefix_in_state_dict_if_present, which are pretty new and was added to PyTorch less than 6 months ago https://github.com/pytorch/pytorch/pull/53224

I would be ok removing the current model_without_ddp in torchvision if we use a newer and better way that is provided by PyTorch, but I’m not sure that the current torch.nn.modules.utils.consume_prefix_in_state_dict_if_present is enough for that (at least it would need some thinking to be able to make sure all cases are handled properly)