Can I replicate Single Node training time using two 4GPUs Node?
See original GitHub issueIn this paper, “Scaling Neural Machine Translation”, it is stated that “we match the accuracy of Vaswani et al. (2017)(i.e. 26.5) in under 5 hours when training on 8 GPUs”.
Is it possible that get the same result with 2 Nodes where each node have 4 NVIDIA-V100 GPUs using the following distributed commands run on two nodes?
System 1:
$ python -m torch.distributed.launch --nproc_per_node=8 \
--nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \
--master_port=1234 \
train.py data-bin/wmt16_en_de_bpe32k \
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0005 --min-lr 1e-09 \
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 3584 \
--fp16
System 2:
$ python -m torch.distributed.launch --nproc_per_node=8 \
--nnodes=2 --node_rank=1 --master_addr="192.168.1.1" \
--master_port=1234 \
train.py data-bin/wmt16_en_de_bpe32k \
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0005 --min-lr 1e-09 \
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 3584 \
--fp16
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Can I replicate Single Node training time using two 4GPUs ...
Yes, that should be fine. But you have a typo, --nproc_per_node should be 4. All reactions.
Read more >Efficient Training on Multiple GPUs - Hugging Face
Special considerations: TP requires very fast network, and therefore it's not advisable to do TP across more than one node. Practically, if a...
Read more >Multi node PyTorch Distributed Training Guide For People In A ...
The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...
Read more >Multi-GPU and distributed training - Keras
In this setup, you have one machine with several GPUs on it (typically 2 to 8). Each device will run a copy of...
Read more >Scalable multi-node deep learning training using GPUs in the ...
In this blog post we demonstrate how to optimize AWS infrastructure to further minimize deep learning training times by using distributed/multi- ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
bsz is
--max-tokens
actually, and cumul is--update-freq
.--max-tokens
corresponds to the number of tokens per GPU.--update-freq
will accumulate gradients from multiple batches before each update. So the effective batch size in tokens is:max_tokens * update_freq * # gpus
. The actual value is logged in the training log underwpb
(words-per-batch).In the case of 32GB cards, you can probably increase
--max-tokens
and decrease--update-freq
to speed up training. Just try to keep wpb between 350-400k.A few things:
--max-tokens=5120 --update-freq=10 --lr=0.001
. This will match Row 6 from the same paper (294 minutes ~= 5 hours).