Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

multi core, gpu training fails

See original GitHub issue

I am doing the nemo asr training on multi gpus and workers and it hangs on the step:

trainer = pl.Trainer(gpus=4, num_nodes=8, accelerator='ddp',max_epochs=200,amp_level='O1', precision=16,

At the console

The same code works fine when using a single GPU and 1 core.

Is there any fix for it, my dataset is very large and thus training will take very long if using single core and GPU.

THanks

Issue Analytics

State:
Created 2 years ago
Comments:10 (3 by maintainers)

Top GitHub Comments

1reaction

BenikaHallcommented, Jul 16, 2021

@ericharper We have it resolved! By submitting the batch job in Slurm, we were able to get multi-GPU training working. We confirmed that the trainer was configured in the YAML file.

0reactions

BenikaHallcommented, Jul 9, 2021

Great. Will do.

Read more comments on GitHub >

Top Results From Across the Web

Efficient Training on Multiple GPUs - Hugging Face

When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...

Multi-GPU training error(OOM) on keras (sufficient memory ...

I am using keras to train my model on ImageNet2012. When I use a batch size of 256 on a single GPU, it...

Frequently Asked Questions — PyTorch 1.13 documentation

Frequently Asked Questions. My model reports “cuda runtime error(2): out of memory”. As the error message suggests, you have run out of memory...

Graphics Processing Unit (GPU) - PyTorch Lightning

Lightning supports multiple ways of doing distributed training. ... To train on CPU/GPU/TPU without changing your code, we need to build a few...

Not enough memory error in multi-GPU training - Support

I do multi-GPU parallel training on a machine with 4 titan X pascal GPUs. I trained with 3 GPUs successfully. But when increased...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Make pretrained models download path configurable

[Question] How to use exported ASR model?