[trainer] figuring out why eval with `--fp16_full_eval` is 25% slower
See original GitHub issueRecently HF trainer was extended to support full fp16 eval via --fp16_full_eval
. I’d have expected it to be either equal or faster than eval with fp32 model, but surprisingly I have noticed a 25% slowdown when using it.
This may or may not impact deepspeed as well, which also runs eval in fp16, but we can’t compare it to a baseline, since it only runs fp16.
I wonder if someone would like to research where the slowdown comes from.
I’d probably isolate the model.half()
call which should be a constant and focus on the rest of the eval. I’m thinking that some component doesn’t take well to fp16 variables. e.g. label smoothing was problematic and now should be fixed in https://github.com/huggingface/transformers/pull/10815, but I tested w/ and w/o label smoothing and it’s not adding to the slowdown.
Here are the script and the corresponding metrics.
First w/o --fp16_full_eval
,
export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0 \
./examples/seq2seq/run_translation.py --model_name_or_path t5-small --output_dir /tmp/zero3 \
--overwrite_output_dir --max_train_samples 10 --max_val_samples 100 --max_source_length 12 \
--max_target_length 128 --val_max_target_length 128 --do_train --num_train_epochs 1 \
--per_device_train_batch_size 2 --learning_rate 3e-3 --warmup_steps 8 --predict_with_generate \
--logging_steps 0 --save_steps 2 --eval_steps 1 --group_by_length --adafactor --dataset_name wmt16 \
--dataset_config ro-en --source_lang en --target_lang ro \
--source_prefix "translate English to Romanian: " --do_eval
***** train metrics *****
epoch = 1.0
init_mem_cpu_alloc_delta = 2MB
init_mem_cpu_peaked_delta = 0MB
init_mem_gpu_alloc_delta = 230MB
init_mem_gpu_peaked_delta = 0MB
train_mem_cpu_alloc_delta = 60MB
train_mem_cpu_peaked_delta = 63MB
train_mem_gpu_alloc_delta = 231MB
train_mem_gpu_peaked_delta = 194MB
train_runtime = 7.7162
train_samples = 10
train_samples_per_second = 0.648
***** eval metrics *****
epoch = 1.0
eval_bleu = 2.4612
eval_gen_len = 18.53
eval_loss = 5.017
eval_mem_cpu_alloc_delta = 0MB
eval_mem_cpu_peaked_delta = 0MB
eval_mem_gpu_alloc_delta = 0MB
eval_mem_gpu_peaked_delta = 244MB
eval_runtime = 4.6481
eval_samples = 100
eval_samples_per_second = 21.514
now let’s add --fp16_full_eval
:
export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0 \
./examples/seq2seq/run_translation.py --model_name_or_path t5-small --output_dir /tmp/zero3 \
--overwrite_output_dir --max_train_samples 10 --max_val_samples 100 --max_source_length 12 \
--max_target_length 128 --val_max_target_length 128 --do_train --num_train_epochs 1 \
--per_device_train_batch_size 2 --learning_rate 3e-3 --warmup_steps 8 --predict_with_generate \
--logging_steps 0 --save_steps 2 --eval_steps 1 --group_by_length --adafactor --dataset_name wmt16 \
--dataset_config ro-en --source_lang en --target_lang ro \
--source_prefix "translate English to Romanian: " --do_eval \
--fp16_full_eval
***** train metrics *****
epoch = 1.0
init_mem_cpu_alloc_delta = 2MB
init_mem_cpu_peaked_delta = 0MB
init_mem_gpu_alloc_delta = 230MB
init_mem_gpu_peaked_delta = 0MB
train_mem_cpu_alloc_delta = 60MB
train_mem_cpu_peaked_delta = 63MB
train_mem_gpu_alloc_delta = 231MB
train_mem_gpu_peaked_delta = 194MB
train_runtime = 7.1477
train_samples = 10
train_samples_per_second = 0.7
***** eval metrics *****
epoch = 1.0
eval_bleu = 2.4612
eval_gen_len = 18.53
eval_loss = 5.0168
eval_mem_cpu_alloc_delta = 0MB
eval_mem_cpu_peaked_delta = 0MB
eval_mem_gpu_alloc_delta = -231MB
eval_mem_gpu_peaked_delta = 262MB
eval_runtime = 6.0125
eval_samples = 100
eval_samples_per_second = 16.632
As you can see w/o --fp16_full_eval
: we get ~22 samples per sec and w/ it only ~17/ - that’s a huge difference.
I also tested with a larger sample and the gap remains constant.
The halving happens here:
Thank you!
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (11 by maintainers)
Top GitHub Comments
Great benchmark of the different data types, thanks for sharing.
I’ve just tested the same script with some of the mbart variants and as expected, fp16 is faster for those.
By running everything with
CUDA_LAUNCH_BLOCKING=1
under the line profiler, I found that this and this check for infinite values take up more time than I expected.After removing those checks, this is what I end up with:
The same with
--fp16_full_eval
:Note that I had to dial up the number of eval examples since this measurement was quite noisy on the shared system I used. However, the FP16 was faster most of the time. If someone could double check these observations under more reliable circumstances, that’ll be great.