question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can I replicate Single Node training time using two 4GPUs Node?

See original GitHub issue

In this paper, “Scaling Neural Machine Translation”, it is stated that “we match the accuracy of Vaswani et al. (2017)(i.e. 26.5) in under 5 hours when training on 8 GPUs”.

Is it possible that get the same result with 2 Nodes where each node have 4 NVIDIA-V100 GPUs using the following distributed commands run on two nodes?

System 1:

$ python -m torch.distributed.launch --nproc_per_node=8 \
    --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \
    --master_port=1234 \
    train.py data-bin/wmt16_en_de_bpe32k \
    --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --lr 0.0005 --min-lr 1e-09 \
    --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 3584 \
    --fp16 

System 2:

$ python -m torch.distributed.launch --nproc_per_node=8 \
    --nnodes=2 --node_rank=1 --master_addr="192.168.1.1" \
    --master_port=1234 \
    train.py data-bin/wmt16_en_de_bpe32k \
    --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --lr 0.0005 --min-lr 1e-09 \
    --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 3584 \
    --fp16 

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
myleottcommented, Feb 6, 2019

bsz is --max-tokens actually, and cumul is --update-freq.

--max-tokens corresponds to the number of tokens per GPU. --update-freq will accumulate gradients from multiple batches before each update. So the effective batch size in tokens is: max_tokens * update_freq * # gpus. The actual value is logged in the training log under wpb (words-per-batch).

In the case of 32GB cards, you can probably increase --max-tokens and decrease --update-freq to speed up training. Just try to keep wpb between 350-400k.

1reaction
myleottcommented, Feb 6, 2019

A few things:

  • The command you listed is for the baseline result, which is Row 3 of Table 1: http://aclweb.org/anthology/W18-6301. The listed time for that result is 495 minutes ~= 8.25 hours.
  • You can improve the speed by setting --max-tokens=5120 --update-freq=10 --lr=0.001. This will match Row 6 from the same paper (294 minutes ~= 5 hours).
  • Note that the estimates for minutes are based on the time it takes to reach 2.11 valid_nll_loss. Training won’t automatically stop at that loss, you either have to manually kill it when it reaches that loss or you can continue training (eventually the loss will reach ~2.06 or 2.07).
Read more comments on GitHub >

github_iconTop Results From Across the Web

Can I replicate Single Node training time using two 4GPUs ...
Yes, that should be fine. But you have a typo, --nproc_per_node should be 4. All reactions.
Read more >
Efficient Training on Multiple GPUs - Hugging Face
Special considerations: TP requires very fast network, and therefore it's not advisable to do TP across more than one node. Practically, if a...
Read more >
Multi node PyTorch Distributed Training Guide For People In A ...
The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...
Read more >
Multi-GPU and distributed training - Keras
In this setup, you have one machine with several GPUs on it (typically 2 to 8). Each device will run a copy of...
Read more >
Scalable multi-node deep learning training using GPUs in the ...
In this blog post we demonstrate how to optimize AWS infrastructure to further minimize deep learning training times by using distributed/multi- ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found