question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Update examples/documentation for torch.distributed.run

See original GitHub issue

Most examples now read local rank from a CLI argument --local_rank. This ensures for it to work when using the torch launch utility torch.distributed.launch. However, since recent torch versions, launch is deprecated. The new suggested way to run distributed code from CLI is with torch.distributed.run. The difference from before is that instead of automatically passing CLI arguments, it sets the rank as an environment variable. The examples therefore need to be updated to read the ENV variable instead of the CLI argument.

See https://pytorch.org/docs/stable/elastic/run.html#launcher-api for more.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
BramVanroycommented, Nov 11, 2021

Definitely. I think it’s a good idea to keep bumping this to open, so that others can also find this issue easily, and it can serve as a reminder that as soon as launch is obsolete, the docs need an update. It’s a tiny thing and easy to forget, so an open issue may help to remember as time goes on.

1reaction
sguggercommented, Oct 6, 2021

I wouldn’t update the documentation as it won’t work for older versions of PyTorch and I would prefer to only put one command vs If your PyTorch version is this then xxx else yyy. Let’s wait a bit more 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Distributed communication package - torch.distributed - PyTorch
The torch.distributed package provides PyTorch support and communication primitives for multiprocess parallelism across several computation nodes running on ...
Read more >
dist docs need an urgent serious update · Issue #60754 - GitHub
NOTE: This module is deprecated, use torch.distributed.run. why is the deprecated Launch utility doc hasn't been updated with what is supposed ...
Read more >
Writing Distributed Applications with PyTorch
"""run.py:""" #!/usr/bin/env python import os import torch import torch.distributed as dist from torch.multiprocessing import Process def run(rank, ...
Read more >
Distributed communication package - torch.distributed
(Note that Gloo currently runs slower than NCCL for GPUs.) CPU hosts with InfiniBand interconnect. If your InfiniBand has enabled IP over IB,...
Read more >
Configuring distributed training for PyTorch | AI Platform Training
This document explains how to create a distributed PyTorch training job. ... Additionally, update your training code to use the torch.nn.parallel.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found