Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

--fp16 fine-tuning appears to be taking more memory (4.3.0).

See original GitHub issue

Environment info

transformers version: 4.3.0.dev0
Platform: Linux-5.4.0-62-generic-x86_64-with-debian-bullseye-sid
Python version: 3.7.6
PyTorch version (GPU?): 1.7.1 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: 4x A100-SXM4-40GB
Using distributed or parallel set-up in script?: Yes

Who can help

@alexorona @stas00 @sgugger

Information

Model I am using (Bert, XLNet …): T5

The problem arises when using:

my own modified scripts: (give details below)

Official trainer, optionally modified by adding "model.parallelize() " after loading. (Results shown with and without).

The tasks I am working on is:

an official GLUE/SQUaD task: Regular seq2seq on data.

Run script:

export BS=1; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1,2,3 ./finetune_trainer.py --model_name_or_path t5-large --output_dir output_dir --adam_eps 1e-06 --data_dir xsum --fp16 \
--do_train --learning_rate 3e-5 \
--logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 \
--overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS \
--warmup_steps 5 \

Brief summary

When fine tuning T5, I’m observing memory usage to increase when using --fp16, though not as much as previously reported in #8403 .
(Optional) Possibly related: I’m trying to squeeze T5-11B in 4x40GB A100s using model parallelism. I seemed to be able to do it yesterday on 4.1.1 with a sequence length of 128, and I remember observing a fairly moderate seqlength vs memory usage dependence (as expected from the comment here ( https://github.com/huggingface/transformers/issues/8771#issuecomment-764058315 ), though I’m not an expert and I’m not sure if this increase only applies >512 tokens, and if what I saw yesterday was a fluke/error on my part somewhere). Today on a fresh env/pull I’m not observing this dependence (though I’m not sure why – it might be my issue – data is reported at the bottom side).

To reproduce

Steps to reproduce the behavior:

4.3.0, runscript as above, run with and without --fp16 option. Different model sizes (and with/without model.parallelize() added, since I wasn’t sure if that was the issue)

Data

Below are three cases of memory usage with/without --fp16:

with model.parallelize()
without model.parallelize() (but GPUs still visible – extra info, I thought it was interesting it still takes up extra memory on the other GPUs)
without model.parallelize() (only 1 GPU visible)

*** WITH MODEL.PARALLELIZE() ***

t5-3b, --max_source_length 128 --max_target_length 128
WITHOUT --fp16
+-----------------------------------------------------------------------------+                                                                                                                                                           
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |                                                                                                                                                           
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:05:00.0 Off |                    0 |
| N/A   28C    P0    77W / 400W |  13598MiB / 40537MiB |     33%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:06:00.0 Off |                    0 |
| N/A   29C    P0    80W / 400W |  12874MiB / 40537MiB |     25%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   24C    P0    81W / 400W |  12874MiB / 40537MiB |      4%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:08:00.0 Off |                    0 |
| N/A   25C    P0    80W / 400W |  12874MiB / 40537MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

WITH --fp16: (takes more memory)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:05:00.0 Off |                    0 |
| N/A   27C    P0   108W / 400W |  15138MiB / 40537MiB |      6%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:06:00.0 Off |                    0 |
| N/A   28C    P0    99W / 400W |  14214MiB / 40537MiB |      9%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   23C    P0    85W / 400W |  14214MiB / 40537MiB |     12%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:08:00.0 Off |                    0 |
| N/A   25C    P0    92W / 400W |  14216MiB / 40537MiB |     11%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+


*** WITHOUT MODEL.PARALLELIZE, but all GPUs still visible ( CUDA_VISIBLE_DEVICES=0,1,2,3 ) ***
t5-large, --max_source_length 128 --max_target_length 128
WITHOUT --fp16
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:05:00.0 Off |                    0 |
| N/A   28C    P0    93W / 400W |  20362MiB / 40537MiB |      1%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:06:00.0 Off |                    0 |
| N/A   27C    P0    78W / 400W |   6046MiB / 40537MiB |      3%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   22C    P0    78W / 400W |   6046MiB / 40537MiB |      3%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:08:00.0 Off |                    0 |
| N/A   24C    P0    79W / 400W |   6022MiB / 40537MiB |      7%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

t5-large, --max_source_length 128 --max_target_length 128
WITH --fp16
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:05:00.0 Off |                    0 |
| N/A   28C    P0    91W / 400W |  20318MiB / 40537MiB |      2%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:06:00.0 Off |                    0 |
| N/A   27C    P0    80W / 400W |   7304MiB / 40537MiB |      4%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   23C    P0    78W / 400W |   7304MiB / 40537MiB |      5%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:08:00.0 Off |                    0 |
| N/A   24C    P0    79W / 400W |   7280MiB / 40537MiB |      5%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+



*** WITHOUT MODEL.PARALLELIZE, ONLY 1 GPU VISIBLE ( CUDA_VISIBLE_DEVICES=0 ) ***

t5-large, --max_source_length 128 --max_target_length 128
WITHOUT --fp16
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:05:00.0 Off |                    0 |
| N/A   29C    P0   101W / 400W |  13790MiB / 40537MiB |     32%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:06:00.0 Off |                    0 |
| N/A   26C    P0    71W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   21C    P0    71W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:08:00.0 Off |                    0 |
| N/A   23C    P0    71W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

t5-large, --max_source_length 128 --max_target_length 128
WITH --fp16 (more memory)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:05:00.0 Off |                    0 |
| N/A   28C    P0   101W / 400W |  15012MiB / 40537MiB |     42%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:06:00.0 Off |                    0 |
| N/A   26C    P0    70W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   21C    P0    71W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:08:00.0 Off |                    0 |
| N/A   23C    P0    71W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

WRT sequence length vs data dependence, in a model parallel setting, I am observing today: (varying --max_source_length and --max_target_length )

model, seq length, gpu0, gpu1, gpu2, gpu3

t5-large, 32 seq length, 5.7GB / 4.7GB / 4.7GB / 4.7GB t5-large, 64 seq length, 5.7GB / 4.7GB / 4.7GB / 4.7GB t5-large, 128 seq length, 5.8GB / 4.8GB / 4.8GB / 4.8GB t5-large, 512 seq length, 6.0GB / 5.2GB / 5.2GB / 5.2GB

t5-3b, 64 seq length, 15.2GB / 14.3GB / 14.3GB / 14.3GB t5-3b, 128 seq length, 15.2GB / 14.3GB / 14.3GB / 14.3GB t5-3b, 256 seq length, 15.5GB / 14.7GB / 14.7GB / 14.7GB t5-3b, 512 seq length, 16.2GB / 15.2GB / 15.2GB / 15.2GB

Essentially very minimal change in RAM requirements vs sequence length. Though perhaps I have misconfigured something here.

Expected behavior

Less memory usage with --fp16 (should it be about half? suggested from https://github.com/huggingface/transformers/issues/8403#issuecomment-725562117 )
(Optional) Nominally, smaller sequence length models taking up significantly less memory?

Issue Analytics

State:
Created 3 years ago
Comments:12 (9 by maintainers)

Top GitHub Comments

7reactions

sguggercommented, Jan 22, 2021

Sleeping on it I would like to amend my first statement. The components on GPU memory are the following:

the model weights
the forward activations saved for gradient computation
the gradients
the optimizer state

If we look at what’s happening with FP16 training (mixed precision) we have:

the model in full precision so no memory saved there
the forward activations saved for gradient computation are in mixed precision
the gradients are computed in mixed precision but converted to full precision for the update, so no saving there
the optimizer state is in full precision as all the updates are done in full precision

So the saving only happen for the forward activations saved for the backward computation, and there is a slight overhead because the gradients are properly stored both in half and full precision. (This is probably over-simplified but I think it’s enough to explain what follows.)

Now let’s look at a simple text-classification fine-tuning on 2 GPUs (I’m giving the command for reference):

export BS=16
python -m torch.distributed.launch \
    --nproc_per_node 2 examples/text-classification/run_glue.py \
    --model_name_or_path bert-base-cased \
    --task_name mrpc \
    --do_train \
    --do_eval \
    --max_seq_length 128 \
    --per_device_train_batch_size $BS \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/mrpc \
    --overwrite_output_dir \
    --fp16

Since the only savings we get are in the model activations saved for the backward passed, it’s logical that the bigger those activations are, the bigger the saving will be. If we try different batch sizes, I indeed get (this is with nvidia-smi so not completely reliable as said above but it will be a fair comparison):

batch size	without --fp16	with --fp16	FP16 savings
8	4247	4163	84
16	4971	4793	178
32	6827	6207	620
64	10037	8061	1976

So there is only a real memory saving if we train at a high batch size (and it’s not half) and at batch sizes lower than 8, you actually get a bigger memory footprint (because of the overhead mentioned above). The gain for FP16 training is that in each of those cases, the training with the flag --fp16 is twice as fast, which does require every tensor to have every dimension be a multiple of 8 (so if your batch size is not a multiple of 8, you won’t get that speed-up, and the script finetune_trainer.py does not pad the tensors to a sequence length that is a multiple of 8).

TL;DR: FP16 with apex or AMP will only give you some memory savings with a reasonably high batch size.

2reactions

stas00commented, Jan 27, 2021

Done: https://github.com/huggingface/transformers/issues/9824

Top Results From Across the Web

How To Fit a Bigger Model and Train It Faster - Hugging Face

Training ever larger models can become challenging even on modern GPUs. Due to their immense size we often run out of GPU memory...

Fine-tuning giant neural networks on ... - Mark Silberstein

Fine-tuning is an increasingly common technique that lever- ages transfer learning to dramatically expedite the training of huge, high-quality models.

Fine-tuning giant neural networks on commodity hardware ...

Compared to PipeDream, FTPipe trains significantly larger models and avoids partitioning solutions that do not fit in. GPU memory. Further, FTPipe outperformed ...

python 2.7 - fine tuning vgg raise memory error - Stack Overflow

Looking at the paper associated with this model, it looks like they were training on 4 GPUs each with 12GB of memory.