Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory blowup with TPU Trainer in master

See original GitHub issue

Environment info

transformers version: 3.0.2 (master)
Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.7.0a0+8fb7c50 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: No
Using distributed or parallel set-up in script?:Yes, TPU v2-8

Who can help

@sgugger @sshleifer @patrickvonplaten

Information

Recent changes to the Trainer for TPU has resulted in memory blowup during training. On a machine with 208GB of RAM [sic], this was the memory profile with the master branch on 20th August.

This only has increase in memory during evaluation (which is another memory leak bug https://github.com/huggingface/transformers/issues/5509). If you throw enough RAM to the problem, it stays in control.

After the recent changes the memory profile has become this.

Look how quickly the memory blows up even on this huge machine. I have implemented some optimizations to save memory where I am caching only a single copy of features on redis-server but that is not enough now. The most interesting thing to see is that now the memory also increases during training and not just evaluation.

After these changes, Trainer for TPUs has become unusable for training any practical model and I request you to please look into fixing this. Model I am using (Bert, XLNet …): T5

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Use the TPU example run_language_modelling to reproduce.

Expected behavior

Memory stays constant with the number of training and evaluation iterations.

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:21 (13 by maintainers)

Top GitHub Comments

5reactions

sguggercommented, Sep 8, 2020

@misrasaurabh1 We just merged a simple fix that was obviously leaking memory for training (non-detached tensors) and that came from a recent change, so it might very well be the source of your leaks. Could you confirm whether or not current master has the leak or not? If so, using the same fix in the evaluation loop should also fix the eval memory leak we currently have.

3reactions

sguggercommented, Sep 8, 2020

Will look at the evaluation leak a bit more. From a first read, it looks like everything is properly detached, so it seems like this leak has another cause.

Thanks a lot for checking!

Top Results From Across the Web

[TPU-Training] Super Fast XLMRoberta - Kaggle

You take the full power of colab TPU and show how to train a model effectively on it with efficient parallelism. It's often...

An error message if you put more than 2^24 items in a JS Map ...

(The TPU has 96 CPUs and 330GB of system RAM, whereas a p3.16xlarge has ... [1]: https://github.com/i5ik/futzz/blob/master/src/futzz.js#L3 ...

ZeRO: Memory Optimizations Toward Training Trillion ...

We develop a novel solution, Zero. Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can...

Debugging a Machine Learning model written in TensorFlow ...

I wrote up a convnet model borrowing liberally from the training loop of the ResNet model written for the TPU and adapted the...

Image classification via fine-tuning with EfficientNet - Keras

Because training EfficientNet on ImageNet takes a tremendous amount of ... TPUClusterResolver.connect() print("Device:", tpu.master()) ...