Memory blowup with TPU Trainer in master
See original GitHub issueEnvironment info
transformers
version: 3.0.2 (master)- Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.7.0a0+8fb7c50 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?:Yes, TPU v2-8
Who can help
@sgugger @sshleifer @patrickvonplaten
Information
Recent changes to the Trainer for TPU has resulted in memory blowup during training.
On a machine with 208GB of RAM [sic], this was the memory profile with the master branch on 20th August.
This only has increase in memory during evaluation (which is another memory leak bug https://github.com/huggingface/transformers/issues/5509). If you throw enough RAM to the problem, it stays in control.
After the recent changes the memory profile has become this.
Look how quickly the memory blows up even on this huge machine. I have implemented some optimizations to save memory where I am caching only a single copy of features on redis-server but that is not enough now. The most interesting thing to see is that now the memory also increases during training and not just evaluation.
After these changes, Trainer for TPUs has become unusable for training any practical model and I request you to please look into fixing this. Model I am using (Bert, XLNet …): T5
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
Use the TPU example run_language_modelling to reproduce.
Expected behavior
Memory stays constant with the number of training and evaluation iterations.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:21 (13 by maintainers)
@misrasaurabh1 We just merged a simple fix that was obviously leaking memory for training (non-detached tensors) and that came from a recent change, so it might very well be the source of your leaks. Could you confirm whether or not current master has the leak or not? If so, using the same fix in the evaluation loop should also fix the eval memory leak we currently have.
Will look at the evaluation leak a bit more. From a first read, it looks like everything is properly detached, so it seems like this leak has another cause.
Thanks a lot for checking!