multi core, gpu training fails
See original GitHub issueI am doing the nemo asr training on multi gpus and workers and it hangs on the step:
trainer = pl.Trainer(gpus=4, num_nodes=8, accelerator='ddp',max_epochs=200,amp_level='O1', precision=16,
At the console
The same code works fine when using a single GPU and 1 core.
Is there any fix for it, my dataset is very large and thus training will take very long if using single core and GPU.
THanks
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (3 by maintainers)
Top Results From Across the Web
Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >Multi-GPU training error(OOM) on keras (sufficient memory ...
I am using keras to train my model on ImageNet2012. When I use a batch size of 256 on a single GPU, it...
Read more >Frequently Asked Questions — PyTorch 1.13 documentation
Frequently Asked Questions. My model reports “cuda runtime error(2): out of memory”. As the error message suggests, you have run out of memory...
Read more >Graphics Processing Unit (GPU) - PyTorch Lightning
Lightning supports multiple ways of doing distributed training. ... To train on CPU/GPU/TPU without changing your code, we need to build a few...
Read more >Not enough memory error in multi-GPU training - Support
I do multi-GPU parallel training on a machine with 4 titan X pascal GPUs. I trained with 3 GPUs successfully. But when increased...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@ericharper We have it resolved! By submitting the batch job in Slurm, we were able to get multi-GPU training working. We confirmed that the trainer was configured in the YAML file.
Great. Will do.