Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PyTorch Distributed: next level of tests

See original GitHub issue

The HuggingFace accelerate integrations provides a good list of the all potential libraries that need to be tested to claim that dynamo/inductor works just fine on single node and in the distributed setting https://github.com/huggingface/accelerate#supported-integrations.

So far @wconstab’s efforts have been focused on getting DDP and FSDP to work but what we should also test are more end to end libraries that wrap DDP or FSDP.

Here’s an example from PyTorch Lightning, does this already work? Probably? Might be worth testing though.

Lightning

trainer = Trainer(devices=4, accelerator="gpu",strategy="ddp")

Specifically for Lightning I’m curious whether their callbacks have a strange interaction with an optimized nn.Module

Accelerate

Another one is from the accelerate repo itself, where we have a popular pattern def prepare(nn.Module) -> nn.Module

model, optimizer, data = accelerator.prepare(model, optimizer, data)

And in this case a top level application of the dynamo decorator might optimize both the model that needs to be prepared and the final model. Should we make it clearer instead in documentation that these library maintainers need to leverage the dynamo decorator inside the prepare function. In which case what happens if a user wraps a a model with a dynamo decorator that itself is wrapped with another dynamo decorator.

Ray

Ray Train has a similar wrapper that prepares models by wrapping with DDP

model = ray.train.torch.prepare_model(nn.Linear(4, 1))

torchrun

Finally end to end workflows might include scripts that: setup a cluster, ssh into machines, setup some environment variables so as a good proxy having an example of torchrun train_script.py working would derisk things quite a bit in more real world scenarios.

Next steps

Once the above workflow is nailed it then makes sense to take a look at supporting DeepSpeed, XLA and Megatron. Specifically for DeepSpeed for now should we just recommend users optimize the model pre wrapping strategy?

Issue Analytics

State:
Created a year ago
Reactions:4
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

wconstabcommented, Nov 16, 2022

@msaroufim I think we should start a new issue for DeepSpeed and start working with the msft team (i sent an opener out on slack, will follow up). XLA enablement also makes sense to track via separate issues (there are multiple XLA distriuted features to tackle). I’m not sure what to do for Megatron - do you have a suggestion there?

I’d move to close this issue as complete, pending Megatron. Not sure if there is something quick to triage there or it’s actually a big project.

1reaction

wconstabcommented, Nov 16, 2022

Ray Train

Applying torchdynamo to the model after ray train.prepare seems to be the way to go, and it works without issue:

    # Create model.
    model = NeuralNetwork()
    model = train.torch.prepare_model(model)
    import torch._dynamo
    import logging
    torch._dynamo.config.optimize_ddp = True
    torch._dynamo.config.log_level = logging.INFO
    model = torch._dynamo.optimize()(model)

Example output:

(RayTrainWorker pid=98151) [2022-11-16 00:48:46,998] torch._dynamo.optimizations.distributed: [INFO] DDPOptimizer used buc[65/12025]214400 and produced the following buckets:                                           
(RayTrainWorker pid=98151) [2022-11-16 00:48:46,999] torch._dynamo.optimizations.distributed: [INFO]                                (RayTrainWorker pid=98151) DDPOptimizer bucket assignments                           
(RayTrainWorker pid=98151) ┌─────────┬────────────┬─────────────────────────────────┐
(RayTrainWorker pid=98151) │   Index │   Size (b) │ Param Names                     │                                               (RayTrainWorker pid=98151) ├─────────┼────────────┼─────────────────────────────────┤
(RayTrainWorker pid=98151) │       0 │    1071144 │ self_linear_relu_stack_4_weight │                                               (RayTrainWorker pid=98151) ├─────────┼────────────┼─────────────────────────────────┤
(RayTrainWorker pid=98151) │         │            │ self_linear_relu_stack_4_bias   │
(RayTrainWorker pid=98151) ├─────────┼────────────┼─────────────────────────────────┤                                               
(RayTrainWorker pid=98151) │         │            │ self_linear_relu_stack_2_weight │
(RayTrainWorker pid=98151) ├─────────┼────────────┼─────────────────────────────────┤                                               
(RayTrainWorker pid=98151) │         │            │ self_linear_relu_stack_2_bias   │
(RayTrainWorker pid=98151) ├─────────┼────────────┼─────────────────────────────────┤                                               
(RayTrainWorker pid=98151) │       1 │    1607680 │ self_linear_relu_stack_0_weight │
(RayTrainWorker pid=98151) ├─────────┼────────────┼─────────────────────────────────┤                                               
(RayTrainWorker pid=98151) │         │            │ self_linear_relu_stack_0_bias   │
(RayTrainWorker pid=98151) └─────────┴────────────┴─────────────────────────────────┘

Full repro ray_torch_fashion_mnist_example.py taken from this ray example

Top Results From Across the Web

Distributed communication package - torch.distributed - PyTorch

The torch. distributed package provides PyTorch support and communication primitives for multiprocess parallelism across several computation nodes running on ...

Multi node PyTorch Distributed Training Guide For People In A ...

The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...

How distributed training works in Pytorch - AI Summer

Learn how distributed training works in pytorch: data parallel, distributed data parallel and automatic mixed precision.

From PyTorch DDP to Accelerate to Trainer, mastery of ...

It will showcase training on multiple GPUs through a process called Distributed Data Parallelism (DDP) through three different levels of ...

PyTorch Distributed Training - Lei Mao's Log Book

PyTorch Distributed Training for Dummies. ... Test loader does not have to follow distributed sampling strategy ... optimizer.step()