--fp16 fine-tuning appears to be taking more memory (4.3.0).
See original GitHub issueEnvironment info
transformers
version: 4.3.0.dev0- Platform: Linux-5.4.0-62-generic-x86_64-with-debian-bullseye-sid
- Python version: 3.7.6
- PyTorch version (GPU?): 1.7.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: 4x A100-SXM4-40GB
- Using distributed or parallel set-up in script?: Yes
Who can help
Information
Model I am using (Bert, XLNet …): T5
The problem arises when using:
- my own modified scripts: (give details below)
Official trainer, optionally modified by adding "model.parallelize() " after loading. (Results shown with and without).
The tasks I am working on is:
- an official GLUE/SQUaD task: Regular seq2seq on data.
Run script:
export BS=1; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1,2,3 ./finetune_trainer.py --model_name_or_path t5-large --output_dir output_dir --adam_eps 1e-06 --data_dir xsum --fp16 \
--do_train --learning_rate 3e-5 \
--logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 \
--overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS \
--warmup_steps 5 \
Brief summary
- When fine tuning T5, I’m observing memory usage to increase when using --fp16, though not as much as previously reported in #8403 .
- (Optional) Possibly related: I’m trying to squeeze T5-11B in 4x40GB A100s using model parallelism. I seemed to be able to do it yesterday on 4.1.1 with a sequence length of 128, and I remember observing a fairly moderate seqlength vs memory usage dependence (as expected from the comment here ( https://github.com/huggingface/transformers/issues/8771#issuecomment-764058315 ), though I’m not an expert and I’m not sure if this increase only applies >512 tokens, and if what I saw yesterday was a fluke/error on my part somewhere). Today on a fresh env/pull I’m not observing this dependence (though I’m not sure why – it might be my issue – data is reported at the bottom side).
To reproduce
Steps to reproduce the behavior:
- 4.3.0, runscript as above, run with and without --fp16 option. Different model sizes (and with/without model.parallelize() added, since I wasn’t sure if that was the issue)
Data
Below are three cases of memory usage with/without --fp16:
- with model.parallelize()
- without model.parallelize() (but GPUs still visible – extra info, I thought it was interesting it still takes up extra memory on the other GPUs)
- without model.parallelize() (only 1 GPU visible)
*** WITH MODEL.PARALLELIZE() ***
t5-3b, --max_source_length 128 --max_target_length 128
WITHOUT --fp16
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:05:00.0 Off | 0 |
| N/A 28C P0 77W / 400W | 13598MiB / 40537MiB | 33% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:06:00.0 Off | 0 |
| N/A 29C P0 80W / 400W | 12874MiB / 40537MiB | 25% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 24C P0 81W / 400W | 12874MiB / 40537MiB | 4% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:08:00.0 Off | 0 |
| N/A 25C P0 80W / 400W | 12874MiB / 40537MiB | 23% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
WITH --fp16: (takes more memory)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:05:00.0 Off | 0 |
| N/A 27C P0 108W / 400W | 15138MiB / 40537MiB | 6% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:06:00.0 Off | 0 |
| N/A 28C P0 99W / 400W | 14214MiB / 40537MiB | 9% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 23C P0 85W / 400W | 14214MiB / 40537MiB | 12% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:08:00.0 Off | 0 |
| N/A 25C P0 92W / 400W | 14216MiB / 40537MiB | 11% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
*** WITHOUT MODEL.PARALLELIZE, but all GPUs still visible ( CUDA_VISIBLE_DEVICES=0,1,2,3 ) ***
t5-large, --max_source_length 128 --max_target_length 128
WITHOUT --fp16
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:05:00.0 Off | 0 |
| N/A 28C P0 93W / 400W | 20362MiB / 40537MiB | 1% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:06:00.0 Off | 0 |
| N/A 27C P0 78W / 400W | 6046MiB / 40537MiB | 3% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 22C P0 78W / 400W | 6046MiB / 40537MiB | 3% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:08:00.0 Off | 0 |
| N/A 24C P0 79W / 400W | 6022MiB / 40537MiB | 7% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
t5-large, --max_source_length 128 --max_target_length 128
WITH --fp16
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:05:00.0 Off | 0 |
| N/A 28C P0 91W / 400W | 20318MiB / 40537MiB | 2% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:06:00.0 Off | 0 |
| N/A 27C P0 80W / 400W | 7304MiB / 40537MiB | 4% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 23C P0 78W / 400W | 7304MiB / 40537MiB | 5% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:08:00.0 Off | 0 |
| N/A 24C P0 79W / 400W | 7280MiB / 40537MiB | 5% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
*** WITHOUT MODEL.PARALLELIZE, ONLY 1 GPU VISIBLE ( CUDA_VISIBLE_DEVICES=0 ) ***
t5-large, --max_source_length 128 --max_target_length 128
WITHOUT --fp16
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:05:00.0 Off | 0 |
| N/A 29C P0 101W / 400W | 13790MiB / 40537MiB | 32% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:06:00.0 Off | 0 |
| N/A 26C P0 71W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 21C P0 71W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:08:00.0 Off | 0 |
| N/A 23C P0 71W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
t5-large, --max_source_length 128 --max_target_length 128
WITH --fp16 (more memory)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:05:00.0 Off | 0 |
| N/A 28C P0 101W / 400W | 15012MiB / 40537MiB | 42% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:06:00.0 Off | 0 |
| N/A 26C P0 70W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 21C P0 71W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:08:00.0 Off | 0 |
| N/A 23C P0 71W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
WRT sequence length vs data dependence, in a model parallel setting, I am observing today: (varying --max_source_length and --max_target_length )
model, seq length, gpu0, gpu1, gpu2, gpu3
t5-large, 32 seq length, 5.7GB / 4.7GB / 4.7GB / 4.7GB t5-large, 64 seq length, 5.7GB / 4.7GB / 4.7GB / 4.7GB t5-large, 128 seq length, 5.8GB / 4.8GB / 4.8GB / 4.8GB t5-large, 512 seq length, 6.0GB / 5.2GB / 5.2GB / 5.2GB
t5-3b, 64 seq length, 15.2GB / 14.3GB / 14.3GB / 14.3GB t5-3b, 128 seq length, 15.2GB / 14.3GB / 14.3GB / 14.3GB t5-3b, 256 seq length, 15.5GB / 14.7GB / 14.7GB / 14.7GB t5-3b, 512 seq length, 16.2GB / 15.2GB / 15.2GB / 15.2GB
Essentially very minimal change in RAM requirements vs sequence length. Though perhaps I have misconfigured something here.
Expected behavior
- Less memory usage with --fp16 (should it be about half? suggested from https://github.com/huggingface/transformers/issues/8403#issuecomment-725562117 )
- (Optional) Nominally, smaller sequence length models taking up significantly less memory?
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (9 by maintainers)
Sleeping on it I would like to amend my first statement. The components on GPU memory are the following:
If we look at what’s happening with FP16 training (mixed precision) we have:
So the saving only happen for the forward activations saved for the backward computation, and there is a slight overhead because the gradients are properly stored both in half and full precision. (This is probably over-simplified but I think it’s enough to explain what follows.)
Now let’s look at a simple text-classification fine-tuning on 2 GPUs (I’m giving the command for reference):
Since the only savings we get are in the model activations saved for the backward passed, it’s logical that the bigger those activations are, the bigger the saving will be. If we try different batch sizes, I indeed get (this is with nvidia-smi so not completely reliable as said above but it will be a fair comparison):
So there is only a real memory saving if we train at a high batch size (and it’s not half) and at batch sizes lower than 8, you actually get a bigger memory footprint (because of the overhead mentioned above). The gain for FP16 training is that in each of those cases, the training with the flag
--fp16
is twice as fast, which does require every tensor to have every dimension be a multiple of 8 (so if your batch size is not a multiple of 8, you won’t get that speed-up, and the scriptfinetune_trainer.py
does not pad the tensors to a sequence length that is a multiple of 8).TL;DR: FP16 with apex or AMP will only give you some memory savings with a reasonably high batch size.
Done: https://github.com/huggingface/transformers/issues/9824