How to debug with multi-gpu training
See original GitHub issueHi, I am trying to debug multi-gpu training with Pycharm. But the multi-gpu training directly called the module torch.distributed.launch
. I didn’t find out how to debug it on Pycharm. I configure it this way
But it threw out error ‘no module named tools/train.py’
Could you please help, I am trying to understand the codes by debugging.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:6 (1 by maintainers)
Top Results From Across the Web
Debugging - Hugging Face
For multi-GPU training it requires DDP ( torch.distributed.launch ). This feature can be used with any nn.Module -based model. If you start ...
Read more >PyTorch 101, Part 4: Memory Management and Using Multiple ...
This article covers PyTorch's advanced GPU management features, how to optimise memory usage and best practises for debugging memory errors.
Read more >Profiling TensorFlow Multi GPU Multi Node Training Job with ...
This notebook walks you through creating a TensorFlow training job with the SageMaker Debugger profiling feature enabled. It will create a multi GPU...
Read more >Testing Multi GPU training on a Single GPU - PyTorch Lightning
I can only submit jobs to a cluster node (4-8GPUs) and can't use the cluster for debugging. The code runs fine on a...
Read more >What's good practice for debugging distributed training? - Reddit
I think you answered your own question in terms of debugging with a single process + single GPU, then adjusting the parameters to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I suggest using single gpu for debugging. It is hard to debug in the distributed training mode.
Solution: Module name: /home/user/.local/lib/python3.6/site-packages/torch/distributed/launch.py or wherever your module.distributed.launch is…