Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[trainer] figuring out why eval with `--fp16_full_eval` is 25% slower

See original GitHub issue

Recently HF trainer was extended to support full fp16 eval via --fp16_full_eval. I’d have expected it to be either equal or faster than eval with fp32 model, but surprisingly I have noticed a 25% slowdown when using it.

This may or may not impact deepspeed as well, which also runs eval in fp16, but we can’t compare it to a baseline, since it only runs fp16.

I wonder if someone would like to research where the slowdown comes from.

I’d probably isolate the model.half() call which should be a constant and focus on the rest of the eval. I’m thinking that some component doesn’t take well to fp16 variables. e.g. label smoothing was problematic and now should be fixed in https://github.com/huggingface/transformers/pull/10815, but I tested w/ and w/o label smoothing and it’s not adding to the slowdown.

Here are the script and the corresponding metrics.

First w/o --fp16_full_eval,

export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0 \
./examples/seq2seq/run_translation.py --model_name_or_path t5-small --output_dir /tmp/zero3 \
--overwrite_output_dir --max_train_samples 10 --max_val_samples 100 --max_source_length 12 \
--max_target_length 128 --val_max_target_length 128 --do_train --num_train_epochs 1 \
--per_device_train_batch_size 2 --learning_rate 3e-3 --warmup_steps 8 --predict_with_generate \
--logging_steps 0 --save_steps 2 --eval_steps 1 --group_by_length --adafactor --dataset_name wmt16 \
--dataset_config ro-en --source_lang en --target_lang ro \
--source_prefix "translate English to Romanian: " --do_eval 

***** train metrics *****
  epoch                      =    1.0
  init_mem_cpu_alloc_delta   =    2MB
  init_mem_cpu_peaked_delta  =    0MB
  init_mem_gpu_alloc_delta   =  230MB
  init_mem_gpu_peaked_delta  =    0MB
  train_mem_cpu_alloc_delta  =   60MB
  train_mem_cpu_peaked_delta =   63MB
  train_mem_gpu_alloc_delta  =  231MB
  train_mem_gpu_peaked_delta =  194MB
  train_runtime              = 7.7162
  train_samples              =     10
  train_samples_per_second   =  0.648
  
***** eval metrics *****
  epoch                     =    1.0
  eval_bleu                 = 2.4612
  eval_gen_len              =  18.53
  eval_loss                 =  5.017
  eval_mem_cpu_alloc_delta  =    0MB
  eval_mem_cpu_peaked_delta =    0MB
  eval_mem_gpu_alloc_delta  =    0MB
  eval_mem_gpu_peaked_delta =  244MB
  eval_runtime              = 4.6481
  eval_samples              =    100
  eval_samples_per_second   = 21.514

now let’s add --fp16_full_eval:

export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0 \
./examples/seq2seq/run_translation.py --model_name_or_path t5-small --output_dir /tmp/zero3 \
--overwrite_output_dir --max_train_samples 10 --max_val_samples 100 --max_source_length 12 \
--max_target_length 128 --val_max_target_length 128 --do_train --num_train_epochs 1 \
--per_device_train_batch_size 2 --learning_rate 3e-3 --warmup_steps 8 --predict_with_generate \
--logging_steps 0 --save_steps 2 --eval_steps 1 --group_by_length --adafactor --dataset_name wmt16 \
--dataset_config ro-en --source_lang en --target_lang ro \
--source_prefix "translate English to Romanian: " --do_eval  \
--fp16_full_eval

***** train metrics *****
  epoch                      =    1.0
  init_mem_cpu_alloc_delta   =    2MB
  init_mem_cpu_peaked_delta  =    0MB
  init_mem_gpu_alloc_delta   =  230MB
  init_mem_gpu_peaked_delta  =    0MB
  train_mem_cpu_alloc_delta  =   60MB
  train_mem_cpu_peaked_delta =   63MB
  train_mem_gpu_alloc_delta  =  231MB
  train_mem_gpu_peaked_delta =  194MB
  train_runtime              = 7.1477
  train_samples              =     10
  train_samples_per_second   =    0.7

***** eval metrics *****
  epoch                     =    1.0
  eval_bleu                 = 2.4612
  eval_gen_len              =  18.53
  eval_loss                 = 5.0168
  eval_mem_cpu_alloc_delta  =    0MB
  eval_mem_cpu_peaked_delta =    0MB
  eval_mem_gpu_alloc_delta  = -231MB
  eval_mem_gpu_peaked_delta =  262MB
  eval_runtime              = 6.0125
  eval_samples              =    100
  eval_samples_per_second   = 16.632

As you can see w/o --fp16_full_eval: we get ~22 samples per sec and w/ it only ~17/ - that’s a huge difference.

I also tested with a larger sample and the gap remains constant.

The halving happens here:

https://github.com/huggingface/transformers/blob/21e86f99e6b91af2e4df3790ba6c781e85fa0eb5/src/transformers/trainer.py#L1800

Thank you!

Issue Analytics

State:
Created 3 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

dsuesscommented, Jan 22, 2022

Here is an additional profiling report of the same issue but under tf32: #14608 (comment)

Great benchmark of the different data types, thanks for sharing.

Have you by chance tried any other architectures using the same benchmarks? e.g. gpt2 and bert as they are very distinct from t5.

I’ve just tested the same script with some of the mbart variants and as expected, fp16 is faster for those.

1reaction

dsuesscommented, Jun 27, 2021

By running everything with CUDA_LAUNCH_BLOCKING=1 under the line profiler, I found that this and this check for infinite values take up more time than I expected.

After removing those checks, this is what I end up with:

$ export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0 \
python -m cProfile -o profile.prof  ./examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --output_dir /tmp/zero3 \
--overwrite_output_dir --max_train_samples 10 --max_eval_samples 1600 --max_source_length 12 \
--max_target_length 128  --do_train --num_train_epochs 1 \
--per_device_train_batch_size 4 --per_device_eval_batch_size $BS --learning_rate 3e-3 --warmup_steps 8 --predict_with_generate \
--logging_steps 0 --save_steps 2 --eval_steps 1 --group_by_length --adafactor --dataset_name wmt16 \
--dataset_config ro-en --source_lang en --target_lang ro \
--source_prefix "translate English to Romanian: " --do_eval
...
***** eval metrics *****
  epoch                   =        1.0
  eval_bleu               =     0.3251
  eval_gen_len            =    10.2375
  eval_loss               =     3.6796
  eval_runtime            = 0:01:03.89
  eval_samples            =       1600
  eval_samples_per_second =      25.04
  eval_steps_per_second   =      1.565

The same with --fp16_full_eval:

***** eval metrics *****
  epoch                   =        1.0
  eval_bleu               =     0.3258
  eval_gen_len            =    10.2406
  eval_loss               =     3.6797
  eval_runtime            = 0:01:01.43
  eval_samples            =       1600
  eval_samples_per_second =     26.043
  eval_steps_per_second   =      1.628

Note that I had to dial up the number of eval examples since this measurement was quite noisy on the shared system I used. However, the FP16 was faster most of the time. If someone could double check these observations under more reliable circumstances, that’ll be great.