question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Benchmark] HF Trainer on RTX-3090

See original GitHub issue

🖥 Benchmarking transformers w/ HF Trainer on RTX-3090

We are going to use a special benchmarking tool that will do all the work for us. https://github.com/huggingface/transformers/pull/14934

This is the index post and specific benchmarks are in their own posts below:

  1. fp16 vs bf16 vs tf32 vs fp32
  2. gradient accumulation steps
  3. gradient checkpointing
  4. batch size
  5. optimizers
  6. combining winning strategies ~2x speed improvement!
  7. RTX-3090 vs A100

See also the same benchmarks for A100

TODO:

  • other suggestions?

Note that each benchmark was run only once, so multiple runs and averaging is probably going to give slightly different results. The purpose here though is to see relative differences roughly and not try to give an exact number.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:5
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
stas00commented, Jan 13, 2022

combining winning strategies

Now let’s combine the winning strategies from each individual benchmark above and compare with the baseline:

Variation Train
samples
per
second
Diff
%
Train
loss
–optim adamw_torch --gradient_accumulation_steps 1 --tf32 0 93.40 0 2.20
–optim adamw_apex_fused --gradient_accumulation_steps 8 --tf32 --bf16 178.90 92 2.62

Getting an almost 2x improvement in speed!

CUDA_VISIBLE_DEVICES=0 python \
/hf/transformers-trainer-benchmark/scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' \
examples/pytorch/translation/run_translation.py --model_name_or_path t5-base --output_dir output_dir \
--do_train --label_smoothing 0.1 --logging_strategy no --save_strategy no --per_device_train_batch_size 16 \
--max_source_length 512 --max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: "  --warmup_steps 50 \
--max_train_samples 10000 --dataloader_num_workers 2 \
' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'--optim adamw_torch --gradient_accumulation_steps 1 --tf32 0|--optim adamw_apex_fused --gradient_accumulation_steps 8 --tf32 --bf16' \
--report-metric-keys train_loss
1reaction
stas00commented, Jan 5, 2022

gradient accumulation steps

Let’s choose t5-base model to test with as it’s pretty large yet doesn’t overflow like t5-large.

Let’s measure --gradient_accumulation_steps 1,2,4,8,16 with different precision configurations.

*** Results:

Variation Train
samples
per
second
Diff
%
Train
loss
–gradient_accumulation_steps 1 --tf32 0 96.17 0 2.20
–gradient_accumulation_steps 1 --tf32 1 116.57 21 2.20
–gradient_accumulation_steps 1 --tf32 0 --fp16 132.64 38 2.20
–gradient_accumulation_steps 1 --tf32 0 --bf16 136.35 42 2.21
–gradient_accumulation_steps 2 --tf32 0 103.83 8 2.28
–gradient_accumulation_steps 2 --tf32 1 130.11 35 2.28
–gradient_accumulation_steps 2 --tf32 0 --fp16 153.09 59 2.28
–gradient_accumulation_steps 2 --tf32 0 --bf16 156.70 63 2.29
–gradient_accumulation_steps 4 --tf32 0 108.48 13 2.39
–gradient_accumulation_steps 4 --tf32 1 137.75 43 2.40
–gradient_accumulation_steps 4 --tf32 0 --fp16 164.48 71 2.40
–gradient_accumulation_steps 4 --tf32 0 --bf16 170.01 77 2.42
–gradient_accumulation_steps 8 --tf32 0 111.14 16 2.57
–gradient_accumulation_steps 8 --tf32 1 141.59 47 2.57
–gradient_accumulation_steps 8 --tf32 0 --fp16 170.77 78 2.57
–gradient_accumulation_steps 8 --tf32 0 --bf16 177.59 85 2.62
–gradient_accumulation_steps 16 --tf32 0 112.65 17 2.81
–gradient_accumulation_steps 16 --tf32 1 143.89 50 2.81
–gradient_accumulation_steps 16 --tf32 0 --fp16 173.69 81 2.81
–gradient_accumulation_steps 16 --tf32 0 --bf16 181.04 88 2.86

Let’s filter out just one subset so that it’s easier to compare the gradient accumulation differences alone, so re-running with just bf16 enabled ( --tf32 0 --bf16):

Variation Train
samples
per
second
Diff
%
Train
loss
–gradient_accumulation_steps 1 135.85 0 2.21
–gradient_accumulation_steps 2 156.95 16 2.29
–gradient_accumulation_steps 4 167.65 23 2.42
–gradient_accumulation_steps 8 175.02 29 2.62
–gradient_accumulation_steps 16 179.15 32 2.86

Conclusions:

  • that’s a significant speed up for even 4 steps
  • notice that the loss gets much bigger with the higher accumulation steps - my benchmark is very short and with less steps to take when the batches are larger, the model simply doesn’t have a chance to step down far enough. The same can be observed with just normal batch size changes. Non-zero lr warm up too plays a role here since it’s a very short run.
*** Setup:


Datetime    : 2022-01-03 14:53:02

Software:
transformers: 4.16.0.dev0
torch       : 1.10.1
cuda        : 11.3
python      : 3.8.11

Hardware:
1 GPUs      : NVIDIA GeForce RTX 3090, 23.70GB


*** The benchmark command line was:

CUDA_VISIBLE_DEVICES=0 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/translation/run_translation.py --model_name_or_path t5-base \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 16 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 10000 --dataloader_num_workers 2 ' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'--gradient_accumulation_steps 1|--gradient_accumulation_steps 2|--gradient_accumulation_steps 4|--gradient_accumulation_steps 8|--gradient_accumulation_steps 16' \
'--tf32 0|--tf32 1|--tf32 0 --fp16|--tf32 0 --bf16' --report-metric-keys \
train_loss --repeat-times 1

Read more comments on GitHub >

github_iconTop Results From Across the Web

Stas Bekman on Twitter: "Here are Transformers benchmarks ...
This PR adds a benchmarking tool for HF Trainer args - e.g. compare --fp16 vs --bf16 performance, but can do that for multiple...
Read more >
Efficient Training on a Single GPU - Hugging Face
For more details see the benchmarks for RTX-3090 and A100. ... HF Trainer integrates a variety of optimisers that can be used out...
Read more >
NVIDIA GeForce RTX 4090 vs RTX 3090 Deep Learning ...
In this blog post, we benchmark RTX 4090 to assess its deep learning training performance and compare its performance against RTX 3090, ...
Read more >
RTX3090 TensorFlow, NAMD and HPCG Performance on ...
I've looked at HF (they are doing great stuff!) I quickly tried their benchmark a few weeks ago but had some trouble and...
Read more >
Benchmarking deep learning workloads with tensorflow on the ...
... the deep learning performance of the latest NVIDIA RTX 3090 GPU. ... RTX Titan and we were eager to benchmark the training...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found