question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPU memory overloads mid-training epoch

See original GitHub issue

OK, here’s the thing.

I’ve trained a couple of models similar to CommonLanguage using the same train.py file in there. However, I keep encountering the same problem, and I want to know if any of you can tell me why it happens.

When I start the training, it occupies a good chunk of the GPU’s memory (which is understandable), but as it continues advancing into an epoch, the size of the occupied memory increases a lot (at sometimes it even jumps 1 GiB when it’s been just a few seconds between nvidia-smi checks), even though, as far as I can tell (and have been told) the memory occupied should remain more or less stable as audio files are loaded into the GPU and then discarded. This ends up causing the memory to collapse, forcing me to use a smaller number of audios than I’d like to in order to be able to train a model, which is the likeliest reason as to why I can’t get a model that can differentiate between different languages with a decent precision.

Any suggestions as to why this happens?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6

github_iconTop GitHub Comments

2reactions
Mlallenacommented, Nov 2, 2021

OK, fixed that part, managed to put it into training - and once more it thinks 90%+ of the test audios are the same language.

0reactions
TParcolletcommented, Oct 26, 2021

Right, I am not ultra familiar with this recipe but the reason of it, imho, is simply because random batches are being created. Basically, the current amount of VRAM depends on the size of the largest sentence encountered up until now. Hence, the VRAM will jump everytime a longer sentence is encountered.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why would GPU memory always surge after training ... - GitHub
I use pytorch lightning to train a model but it always strangely fail at end: After validations completed, the trainer will start an...
Read more >
Increase of GPU memory usage during training - Stack Overflow
The problem is likely because the gradients are being computed and stored in the validation loop. To solve that, perhaps the easiest way...
Read more >
GPU memory surge after training epochs causing CUDA ...
I use pytorch lightning to train a model but it always strangely fail at end: After validations completed, the trainer will start an...
Read more >
Pytorch bug and solution: vram usage increases for every epoch
bug: pytorch vram(GPU memory) usage keeps increasing for epochs during training after calling torch.utils.data.random_split().
Read more >
GPU memory consumption increases while training
Hello, all I am new to Pytorch and I meet a strange GPU memory behavior ... def train(train_loader, model, criterion, optimizer, epoch): ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found