question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to debug with multi-gpu training

See original GitHub issue

Hi, I am trying to debug multi-gpu training with Pycharm. But the multi-gpu training directly called the module torch.distributed.launch. I didn’t find out how to debug it on Pycharm. I configure it this way pycharmdebug But it threw out error ‘no module named tools/train.py’

Could you please help, I am trying to understand the codes by debugging.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

4reactions
hellockcommented, Jul 15, 2019

I suggest using single gpu for debugging. It is hard to debug in the distributed training mode.

2reactions
Dejan1969commented, Mar 15, 2020

Solution: Module name: /home/user/.local/lib/python3.6/site-packages/torch/distributed/launch.py or wherever your module.distributed.launch is…

Read more comments on GitHub >

github_iconTop Results From Across the Web

Debugging - Hugging Face
For multi-GPU training it requires DDP ( torch.distributed.launch ). This feature can be used with any nn.Module -based model. If you start ...
Read more >
PyTorch 101, Part 4: Memory Management and Using Multiple ...
This article covers PyTorch's advanced GPU management features, how to optimise memory usage and best practises for debugging memory errors.
Read more >
Profiling TensorFlow Multi GPU Multi Node Training Job with ...
This notebook walks you through creating a TensorFlow training job with the SageMaker Debugger profiling feature enabled. It will create a multi GPU...
Read more >
Testing Multi GPU training on a Single GPU - PyTorch Lightning
I can only submit jobs to a cluster node (4-8GPUs) and can't use the cluster for debugging. The code runs fine on a...
Read more >
What's good practice for debugging distributed training? - Reddit
I think you answered your own question in terms of debugging with a single process + single GPU, then adjusting the parameters to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found