Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can I replicate Single Node training time using two 4GPUs Node?

See original GitHub issue

In this paper, “Scaling Neural Machine Translation”, it is stated that “we match the accuracy of Vaswani et al. (2017)(i.e. 26.5) in under 5 hours when training on 8 GPUs”.

Is it possible that get the same result with 2 Nodes where each node have 4 NVIDIA-V100 GPUs using the following distributed commands run on two nodes?

System 1:

$ python -m torch.distributed.launch --nproc_per_node=8 \
    --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \
    --master_port=1234 \
    train.py data-bin/wmt16_en_de_bpe32k \
    --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --lr 0.0005 --min-lr 1e-09 \
    --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 3584 \
    --fp16

System 2:

$ python -m torch.distributed.launch --nproc_per_node=8 \
    --nnodes=2 --node_rank=1 --master_addr="192.168.1.1" \
    --master_port=1234 \
    train.py data-bin/wmt16_en_de_bpe32k \
    --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --lr 0.0005 --min-lr 1e-09 \
    --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 3584 \
    --fp16

Issue Analytics

State:
Created 5 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

myleottcommented, Feb 6, 2019

bsz is --max-tokens actually, and cumul is --update-freq.

--max-tokens corresponds to the number of tokens per GPU. --update-freq will accumulate gradients from multiple batches before each update. So the effective batch size in tokens is: max_tokens * update_freq * # gpus. The actual value is logged in the training log under wpb (words-per-batch).

In the case of 32GB cards, you can probably increase --max-tokens and decrease --update-freq to speed up training. Just try to keep wpb between 350-400k.

1reaction

myleottcommented, Feb 6, 2019

A few things:

The command you listed is for the baseline result, which is Row 3 of Table 1: http://aclweb.org/anthology/W18-6301. The listed time for that result is 495 minutes ~= 8.25 hours.
You can improve the speed by setting --max-tokens=5120 --update-freq=10 --lr=0.001. This will match Row 6 from the same paper (294 minutes ~= 5 hours).
Note that the estimates for minutes are based on the time it takes to reach 2.11 valid_nll_loss. Training won’t automatically stop at that loss, you either have to manually kill it when it reaches that loss or you can continue training (eventually the loss will reach ~2.06 or 2.07).