Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Benchmark] HF Trainer on RTX-3090

See original GitHub issue

🖥 Benchmarking `transformers` w/ HF Trainer on RTX-3090

We are going to use a special benchmarking tool that will do all the work for us. https://github.com/huggingface/transformers/pull/14934

This is the index post and specific benchmarks are in their own posts below:

fp16 vs bf16 vs tf32 vs fp32
gradient accumulation steps
gradient checkpointing
batch size
optimizers
combining winning strategies ~2x speed improvement!
RTX-3090 vs A100

Issue Analytics

State:
Created 2 years ago
Reactions:5
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

stas00commented, Jan 13, 2022

combining winning strategies

Now let’s combine the winning strategies from each individual benchmark above and compare with the baseline:

Variation	Train samples per second	Diff %	Train loss
–optim adamw_torch --gradient_accumulation_steps 1 --tf32 0	93.40	0	2.20
–optim adamw_apex_fused --gradient_accumulation_steps 8 --tf32 --bf16	178.90	92	2.62

Getting an almost 2x improvement in speed!

CUDA_VISIBLE_DEVICES=0 python \
/hf/transformers-trainer-benchmark/scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' \
examples/pytorch/translation/run_translation.py --model_name_or_path t5-base --output_dir output_dir \
--do_train --label_smoothing 0.1 --logging_strategy no --save_strategy no --per_device_train_batch_size 16 \
--max_source_length 512 --max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: "  --warmup_steps 50 \
--max_train_samples 10000 --dataloader_num_workers 2 \
' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'--optim adamw_torch --gradient_accumulation_steps 1 --tf32 0|--optim adamw_apex_fused --gradient_accumulation_steps 8 --tf32 --bf16' \
--report-metric-keys train_loss

1reaction

stas00commented, Jan 5, 2022

gradient accumulation steps

Let’s choose t5-base model to test with as it’s pretty large yet doesn’t overflow like t5-large.

Let’s measure --gradient_accumulation_steps 1,2,4,8,16 with different precision configurations.

*** Results:

Variation	Train samples per second	Diff %	Train loss
–gradient_accumulation_steps 1 --tf32 0	96.17	0	2.20
–gradient_accumulation_steps 1 --tf32 1	116.57	21	2.20
–gradient_accumulation_steps 1 --tf32 0 --fp16	132.64	38	2.20
–gradient_accumulation_steps 1 --tf32 0 --bf16	136.35	42	2.21
–gradient_accumulation_steps 2 --tf32 0	103.83	8	2.28
–gradient_accumulation_steps 2 --tf32 1	130.11	35	2.28
–gradient_accumulation_steps 2 --tf32 0 --fp16	153.09	59	2.28
–gradient_accumulation_steps 2 --tf32 0 --bf16	156.70	63	2.29
–gradient_accumulation_steps 4 --tf32 0	108.48	13	2.39
–gradient_accumulation_steps 4 --tf32 1	137.75	43	2.40
–gradient_accumulation_steps 4 --tf32 0 --fp16	164.48	71	2.40
–gradient_accumulation_steps 4 --tf32 0 --bf16	170.01	77	2.42
–gradient_accumulation_steps 8 --tf32 0	111.14	16	2.57
–gradient_accumulation_steps 8 --tf32 1	141.59	47	2.57
–gradient_accumulation_steps 8 --tf32 0 --fp16	170.77	78	2.57
–gradient_accumulation_steps 8 --tf32 0 --bf16	177.59	85	2.62
–gradient_accumulation_steps 16 --tf32 0	112.65	17	2.81
–gradient_accumulation_steps 16 --tf32 1	143.89	50	2.81
–gradient_accumulation_steps 16 --tf32 0 --fp16	173.69	81	2.81
–gradient_accumulation_steps 16 --tf32 0 --bf16	181.04	88	2.86

Let’s filter out just one subset so that it’s easier to compare the gradient accumulation differences alone, so re-running with just bf16 enabled ( --tf32 0 --bf16):

Variation	Train samples per second	Diff %	Train loss
–gradient_accumulation_steps 1	135.85	0	2.21
–gradient_accumulation_steps 2	156.95	16	2.29
–gradient_accumulation_steps 4	167.65	23	2.42
–gradient_accumulation_steps 8	175.02	29	2.62
–gradient_accumulation_steps 16	179.15	32	2.86

Conclusions:

that’s a significant speed up for even 4 steps
notice that the loss gets much bigger with the higher accumulation steps - my benchmark is very short and with less steps to take when the batches are larger, the model simply doesn’t have a chance to step down far enough. The same can be observed with just normal batch size changes. Non-zero lr warm up too plays a role here since it’s a very short run.

*** Setup:


Datetime    : 2022-01-03 14:53:02

Software:
transformers: 4.16.0.dev0
torch       : 1.10.1
cuda        : 11.3
python      : 3.8.11

Hardware:
1 GPUs      : NVIDIA GeForce RTX 3090, 23.70GB


*** The benchmark command line was:

CUDA_VISIBLE_DEVICES=0 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/translation/run_translation.py --model_name_or_path t5-base \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 16 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 10000 --dataloader_num_workers 2 ' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'--gradient_accumulation_steps 1|--gradient_accumulation_steps 2|--gradient_accumulation_steps 4|--gradient_accumulation_steps 8|--gradient_accumulation_steps 16' \
'--tf32 0|--tf32 1|--tf32 0 --fp16|--tf32 0 --bf16' --report-metric-keys \
train_loss --repeat-times 1