question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

multi core, gpu training fails

See original GitHub issue

I am doing the nemo asr training on multi gpus and workers and it hangs on the step:

trainer = pl.Trainer(gpus=4, num_nodes=8, accelerator='ddp',max_epochs=200,amp_level='O1', precision=16,

At the console

image

The same code works fine when using a single GPU and 1 core.

Is there any fix for it, my dataset is very large and thus training will take very long if using single core and GPU.

THanks

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
BenikaHallcommented, Jul 16, 2021

@ericharper We have it resolved! By submitting the batch job in Slurm, we were able to get multi-GPU training working. We confirmed that the trainer was configured in the YAML file.

0reactions
BenikaHallcommented, Jul 9, 2021

Great. Will do.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >
Multi-GPU training error(OOM) on keras (sufficient memory ...
I am using keras to train my model on ImageNet2012. When I use a batch size of 256 on a single GPU, it...
Read more >
Frequently Asked Questions — PyTorch 1.13 documentation
Frequently Asked Questions. My model reports “cuda runtime error(2): out of memory”. As the error message suggests, you have run out of memory...
Read more >
Graphics Processing Unit (GPU) - PyTorch Lightning
Lightning supports multiple ways of doing distributed training. ... To train on CPU/GPU/TPU without changing your code, we need to build a few...
Read more >
Not enough memory error in multi-GPU training - Support
I do multi-GPU parallel training on a machine with 4 titan X pascal GPUs. I trained with 3 GPUs successfully. But when increased...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found