[Benchmark] HF Trainer on RTX-3090
See original GitHub issue🖥 Benchmarking transformers
w/ HF Trainer on RTX-3090
We are going to use a special benchmarking tool that will do all the work for us. https://github.com/huggingface/transformers/pull/14934
This is the index post and specific benchmarks are in their own posts below:
- fp16 vs bf16 vs tf32 vs fp32
- gradient accumulation steps
- gradient checkpointing
- batch size
- optimizers
- combining winning strategies ~2x speed improvement!
- RTX-3090 vs A100
See also the same benchmarks for A100
TODO:
- other suggestions?
Note that each benchmark was run only once, so multiple runs and averaging is probably going to give slightly different results. The purpose here though is to see relative differences roughly and not try to give an exact number.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:5
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Stas Bekman on Twitter: "Here are Transformers benchmarks ...
This PR adds a benchmarking tool for HF Trainer args - e.g. compare --fp16 vs --bf16 performance, but can do that for multiple...
Read more >Efficient Training on a Single GPU - Hugging Face
For more details see the benchmarks for RTX-3090 and A100. ... HF Trainer integrates a variety of optimisers that can be used out...
Read more >NVIDIA GeForce RTX 4090 vs RTX 3090 Deep Learning ...
In this blog post, we benchmark RTX 4090 to assess its deep learning training performance and compare its performance against RTX 3090, ...
Read more >RTX3090 TensorFlow, NAMD and HPCG Performance on ...
I've looked at HF (they are doing great stuff!) I quickly tried their benchmark a few weeks ago but had some trouble and...
Read more >Benchmarking deep learning workloads with tensorflow on the ...
... the deep learning performance of the latest NVIDIA RTX 3090 GPU. ... RTX Titan and we were eager to benchmark the training...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
combining winning strategies
Now let’s combine the winning strategies from each individual benchmark above and compare with the baseline:
samples
per
second
%
loss
Getting an almost 2x improvement in speed!
gradient accumulation steps
Let’s choose
t5-base
model to test with as it’s pretty large yet doesn’t overflow like t5-large.Let’s measure
--gradient_accumulation_steps
1,2,4,8,16 with different precision configurations.*** Results:
samples
per
second
%
loss
Let’s filter out just one subset so that it’s easier to compare the gradient accumulation differences alone, so re-running with just bf16 enabled (
--tf32 0 --bf16
):samples
per
second
%
loss
Conclusions: