TorchElastic standalone test silently fails
See original GitHub issueBug
The test case
silently fails in our CI.
Output
Running plugins/environments/torch_elastic_deadlock.py
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
| Name | Type | Params
---------------------------------
0 | layer | Linear | 66
---------------------------------
66 Trainable params
0 Non-trainable params
66 Total params
0.000 Total estimated model params size (MB)
/__w/2/s/src/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 64 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
/__w/2/s/src/pytorch_lightning/trainer/trainer.py:1555: PossibleUserWarning: The number of training batches (5) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
rank_zero_warn(
/__w/2/s/src/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 64 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
[W reducer.cpp:1251] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1251] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 54410) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 765, in <module>
main()
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Observed on latest commit on master.
Expected Behavior
The test should pass. Additionally, should the test fail like it does now, the error should be reported in the CI. We currently grep for SUCCEEDED but it doesn’t make the CI fail.
Issue Analytics
- State:
- Created a year ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
[RFC] Upstream TorchElastic to PyTorch · Issue #50621 - GitHub
A single point of failure on the elected MASTER_ADDRESS; No external library/service dependency (hence the name “Standalone”). ManualRendezvous ...
Read more >Changelog — PyTorch Lightning 1.8.5 documentation
{fit,validate,test,predict,tune} methods now raise a useful error message if the input ... an error is now raised instead of silently taking the mean...
Read more >Elastic Launch — PyTorch/Elastic master documentation
Worker failures are handled gracefully by restarting all workers. Worker RANK and WORLD_SIZE are assigned automatically. Number of nodes is allowed to change ......
Read more >Fault tolerant distributed machine learning training with the ...
With the TorchElastic Controller for Kubernetes, you can reduce distributed training time and cost by limiting idle cluster resources and ...
Read more >Elastic training with Classy Vision
docker run --shm-size=2g --gpus=all torchelastic/examples:$VERSION --standalone --nnodes=1 --nproc_per_node=$NUM_CUDA_DEVICES ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Note that even though we don’t have a reliable solution to surface these distributed issues, we could still fix this test in particular so that it passes again.
There’s nothing we can do other than to fix the test and grep for “error” in its text output as I did to resolve https://github.com/Lightning-AI/lightning/issues/12474