Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPU memory overloads mid-training epoch

See original GitHub issue

OK, here’s the thing.

I’ve trained a couple of models similar to CommonLanguage using the same train.py file in there. However, I keep encountering the same problem, and I want to know if any of you can tell me why it happens.

When I start the training, it occupies a good chunk of the GPU’s memory (which is understandable), but as it continues advancing into an epoch, the size of the occupied memory increases a lot (at sometimes it even jumps 1 GiB when it’s been just a few seconds between nvidia-smi checks), even though, as far as I can tell (and have been told) the memory occupied should remain more or less stable as audio files are loaded into the GPU and then discarded. This ends up causing the memory to collapse, forcing me to use a smaller number of audios than I’d like to in order to be able to train a model, which is the likeliest reason as to why I can’t get a model that can differentiate between different languages with a decent precision.

Any suggestions as to why this happens?

Issue Analytics

State:
Created 2 years ago
Comments:6

Top GitHub Comments

2reactions

Mlallenacommented, Nov 2, 2021

OK, fixed that part, managed to put it into training - and once more it thinks 90%+ of the test audios are the same language.

0reactions

TParcolletcommented, Oct 26, 2021

Right, I am not ultra familiar with this recipe but the reason of it, imho, is simply because random batches are being created. Basically, the current amount of VRAM depends on the size of the largest sentence encountered up until now. Hence, the VRAM will jump everytime a longer sentence is encountered.