PyTorch Distributed: next level of tests
See original GitHub issueThe HuggingFace accelerate integrations provides a good list of the all potential libraries that need to be tested to claim that dynamo/inductor works just fine on single node and in the distributed setting https://github.com/huggingface/accelerate#supported-integrations.
So far @wconstabβs efforts have been focused on getting DDP and FSDP to work but what we should also test are more end to end libraries that wrap DDP or FSDP.
Hereβs an example from PyTorch Lightning, does this already work? Probably? Might be worth testing though.
Lightning
trainer = Trainer(devices=4, accelerator="gpu",strategy="ddp")
Specifically for Lightning Iβm curious whether their callbacks have a strange interaction with an optimized nn.Module
Accelerate
Another one is from the accelerate repo itself, where we have a popular pattern def prepare(nn.Module) -> nn.Module
model, optimizer, data = accelerator.prepare(model, optimizer, data)
And in this case a top level application of the dynamo decorator might optimize both the model that needs to be prepared and the final model. Should we make it clearer instead in documentation that these library maintainers need to leverage the dynamo decorator inside the prepare function. In which case what happens if a user wraps a a model with a dynamo decorator that itself is wrapped with another dynamo decorator.
Ray
Ray Train has a similar wrapper that prepares models by wrapping with DDP
model = ray.train.torch.prepare_model(nn.Linear(4, 1))
torchrun
Finally end to end workflows might include scripts that: setup a cluster, ssh into machines, setup some environment variables so as a good proxy having an example of torchrun train_script.py
working would derisk things quite a bit in more real world scenarios.
Next steps
Once the above workflow is nailed it then makes sense to take a look at supporting DeepSpeed, XLA and Megatron. Specifically for DeepSpeed for now should we just recommend users optimize the model pre wrapping strategy?
Issue Analytics
- State:
- Created a year ago
- Reactions:4
- Comments:6 (6 by maintainers)
Top GitHub Comments
@msaroufim I think we should start a new issue for DeepSpeed and start working with the msft team (i sent an opener out on slack, will follow up). XLA enablement also makes sense to track via separate issues (there are multiple XLA distriuted features to tackle). Iβm not sure what to do for Megatron - do you have a suggestion there?
Iβd move to close this issue as complete, pending Megatron. Not sure if there is something quick to triage there or itβs actually a big project.
Ray Train
Applying torchdynamo to the model after ray train.prepare seems to be the way to go, and it works without issue:
Example output:
Full repro ray_torch_fashion_mnist_example.py taken from this ray example