GPU memory overloads mid-training epoch
See original GitHub issueOK, here’s the thing.
I’ve trained a couple of models similar to CommonLanguage using the same train.py file in there. However, I keep encountering the same problem, and I want to know if any of you can tell me why it happens.
When I start the training, it occupies a good chunk of the GPU’s memory (which is understandable), but as it continues advancing into an epoch, the size of the occupied memory increases a lot (at sometimes it even jumps 1 GiB when it’s been just a few seconds between nvidia-smi
checks), even though, as far as I can tell (and have been told) the memory occupied should remain more or less stable as audio files are loaded into the GPU and then discarded. This ends up causing the memory to collapse, forcing me to use a smaller number of audios than I’d like to in order to be able to train a model, which is the likeliest reason as to why I can’t get a model that can differentiate between different languages with a decent precision.
Any suggestions as to why this happens?
Issue Analytics
- State:
- Created 2 years ago
- Comments:6
Top GitHub Comments
OK, fixed that part, managed to put it into training - and once more it thinks 90%+ of the test audios are the same language.
Right, I am not ultra familiar with this recipe but the reason of it, imho, is simply because random batches are being created. Basically, the current amount of VRAM depends on the size of the largest sentence encountered up until now. Hence, the VRAM will jump everytime a longer sentence is encountered.