Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is my training routine normal?

See original GitHub issue

Hello， I’m a newcomer to NLP. I have tried to install cuda, cudnn, NCCL and pytorch myself, but I don’t know if my training process is normal. Here is my training log:

| [src] dictionary: 40000 types
| [tgt] dictionary: 50000 types
| ./data/train_data train 10000000 examples
| ./data/train_data valid 3000 examples
| model transformer_wmt_en_de, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 115818496
| training on 4 GPUs
| max tokens per GPU = 2048 and max sentences per GPU = 2000
| WARNING: overflow detected, setting loss scale to: 64.0
| WARNING: overflow detected, setting loss scale to: 32.0
| WARNING: overflow detected, setting loss scale to: 16.0
| WARNING: overflow detected, setting loss scale to: 8.0
| epoch 001:     50 / 9199 loss=15.519, nll_loss=15.428, ppl=44083.35, wps=12245, ups=0.6, wpb=21134, bsz=1042, num_updates=47, lr=5.97383e-06, gnorm=4.899, clip=0%, oom=0, loss_scale=8.000, wall=81, train_wall=15
| epoch 001:    100 / 9199 loss=14.736, nll_loss=14.554, ppl=24057.37, wps=21332, ups=1.0, wpb=21317, bsz=1078, num_updates=97, lr=1.22226e-05, gnorm=3.181, clip=0%, oom=0, loss_scale=8.000, wall=97, train_wall=29
| epoch 001:    150 / 9199 loss=14.285, nll_loss=14.053, ppl=17000.59, wps=27956, ups=1.3, wpb=21369, bsz=1083, num_updates=147, lr=1.84713e-05, gnorm=2.565, clip=0%, oom=0, loss_scale=8.000, wall=112, train_wall=43
| epoch 001:    200 / 9199 loss=13.904, nll_loss=13.630, ppl=12680.30, wps=32875, ups=1.5, wpb=21439, bsz=1085, num_updates=197, lr=2.47201e-05, gnorm=2.289, clip=0%, oom=0, loss_scale=8.000, wall=128, train_wall=57
| epoch 001:    250 / 9199 loss=13.551, nll_loss=13.238, ppl=9658.64, wps=36651, ups=1.7, wpb=21473, bsz=1090, num_updates=247, lr=3.09688e-05, gnorm=2.186, clip=0%, oom=0, loss_scale=8.000, wall=145, train_wall=71
| epoch 001:    300 / 9199 loss=13.227, nll_loss=12.872, ppl=7495.85, wps=39749, ups=1.8, wpb=21536, bsz=1093, num_updates=297, lr=3.72176e-05, gnorm=2.070, clip=0%, oom=0, loss_scale=8.000, wall=161, train_wall=86
| epoch 001:    350 / 9199 loss=12.948, nll_loss=12.554, ppl=6015.14, wps=42313, ups=2.0, wpb=21539, bsz=1097, num_updates=347, lr=4.34663e-05, gnorm=1.959, clip=0%, oom=0, loss_scale=8.000, wall=177, train_wall=100
| epoch 001:    400 / 9199 loss=12.706, nll_loss=12.276, ppl=4960.26, wps=44491, ups=2.1, wpb=21546, bsz=1100, num_updates=397, lr=4.97151e-05, gnorm=1.907, clip=0%, oom=0, loss_scale=8.000, wall=192, train_wall=114
| epoch 001:    450 / 9199 loss=12.507, nll_loss=12.045, ppl=4226.99, wps=46169, ups=2.1, wpb=21487, bsz=1093, num_updates=447, lr=5.59638e-05, gnorm=1.836, clip=0%, oom=0, loss_scale=8.000, wall=208, train_wall=128
| epoch 001:    500 / 9199 loss=12.322, nll_loss=11.830, ppl=3640.19, wps=47629, ups=2.2, wpb=21497, bsz=1095, num_updates=497, lr=6.22126e-05, gnorm=1.774, clip=0%, oom=0, loss_scale=8.000, wall=224, train_wall=143
| epoch 001:    550 / 9199 loss=12.156, nll_loss=11.637, ppl=3185.81, wps=48939, ups=2.3, wpb=21487, bsz=1093, num_updates=547, lr=6.84613e-05, gnorm=1.741, clip=0%, oom=0, loss_scale=8.000, wall=240, train_wall=157
| epoch 001:    600 / 9199 loss=12.012, nll_loss=11.469, ppl=2835.67, wps=49972, ups=2.3, wpb=21460, bsz=1090, num_updates=597, lr=7.47101e-05, gnorm=1.707, clip=0%, oom=0, loss_scale=8.000, wall=256, train_wall=171
| epoch 001:    650 / 9199 loss=11.878, nll_loss=11.313, ppl=2543.67, wps=50990, ups=2.4, wpb=21440, bsz=1087, num_updates=647, lr=8.09588e-05, gnorm=1.670, clip=0%, oom=0, loss_scale=8.000, wall=272, train_wall=185
| epoch 001:    700 / 9199 loss=11.756, nll_loss=11.170, ppl=2304.89, wps=51943, ups=2.4, wpb=21428, bsz=1084, num_updates=697, lr=8.72076e-05, gnorm=1.646, clip=0%, oom=0, loss_scale=8.000, wall=288, train_wall=199
| epoch 001:    750 / 9199 loss=11.642, nll_loss=11.038, ppl=2102.54, wps=52763, ups=2.5, wpb=21417, bsz=1081, num_updates=747, lr=9.34563e-05, gnorm=1.623, clip=0%, oom=0, loss_scale=8.000, wall=303, train_wall=214
| epoch 001:    800 / 9199 loss=11.535, nll_loss=10.913, ppl=1928.65, wps=53370, ups=2.5, wpb=21403, bsz=1081, num_updates=797, lr=9.97051e-05, gnorm=1.607, clip=0%, oom=0, loss_scale=8.000, wall=320, train_wall=229
| epoch 001:    850 / 9199 loss=11.434, nll_loss=10.795, ppl=1776.92, wps=54017, ups=2.5, wpb=21407, bsz=1084, num_updates=847, lr=0.000105954, gnorm=1.597, clip=0%, oom=0, loss_scale=8.000, wall=336, train_wall=243
| epoch 001:    900 / 9199 loss=11.339, nll_loss=10.685, ppl=1646.52, wps=54628, ups=2.6, wpb=21397, bsz=1082, num_updates=897, lr=0.000112203, gnorm=1.575, clip=0%, oom=0, loss_scale=8.000, wall=351, train_wall=257
| epoch 001:    950 / 9199 loss=11.249, nll_loss=10.579, ppl=1530.19, wps=55213, ups=2.6, wpb=21403, bsz=1083, num_updates=947, lr=0.000118451, gnorm=1.560, clip=0%, oom=0, loss_scale=8.000, wall=367, train_wall=271
| epoch 001:   1000 / 9199 loss=11.161, nll_loss=10.477, ppl=1425.51, wps=55661, ups=2.6, wpb=21414, bsz=1083, num_updates=997, lr=0.0001247, gnorm=1.538, clip=0%, oom=0, loss_scale=8.000, wall=384, train_wall=286
| epoch 001 | valid on 'valid' subset | valid_loss 8.36527 | valid_nll_loss 7.22753 | valid_ppl 149.87 | num_updates 1000
| epoch 001:   1050 / 9199 loss=11.080, nll_loss=10.384, ppl=1335.81, wps=55444, ups=2.6, wpb=21405, bsz=1079, num_updates=1047, lr=0.000130949, gnorm=1.521, clip=0%, oom=0, loss_scale=8.000, wall=404, train_wall=300
| epoch 001:   1100 / 9199 loss=11.001, nll_loss=10.291, ppl=1252.64, wps=55914, ups=2.6, wpb=21418, bsz=1079, num_updates=1097, lr=0.000137198, gnorm=1.499, clip=0%, oom=0, loss_scale=8.000, wall=420, train_wall=315
| epoch 001:   1150 / 9199 loss=10.923, nll_loss=10.201, ppl=1176.73, wps=56419, ups=2.6, wpb=21443, bsz=1080, num_updates=1147, lr=0.000143446, gnorm=1.484, clip=0%, oom=0, loss_scale=8.000, wall=436, train_wall=329

I found that the parameter wps is about 50K, is this normal? I am worried that the installation of cudnn or NCCL has problems, which leads to slow training speed.

Issue Analytics

State:
Created 4 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

jiezhangGtcommented, Jul 31, 2019

What kind of GPU? I see you are using 4, are they all on the same machine? What version of PyTorch, NCCL, cuDNN, etc?

The kind of GPU is Tesla V100 , PyTorch is 1.1.0, NCCL is 2.x, cudn10.0 and cuDNN is 7.5.x. And the 4 GPUs are on the same machine.

0reactions

jiezhangGtcommented, Aug 6, 2019

Your latest comment has a different architecture than your first comment, you changed from transformer_wmt_en_de to transformer_wmt_en_de_big. Different architectures will have different training speeds, so it’ll be helpful to keep that consistent.

Another thing I forgot to mention is that you should install apex: https://github.com/NVIDIA/apex/. Make sure to installing Apex with the CUDA and C++ extensions. Fairseq will automatically pick them up and it should improve speed by a decent amount.

Also can you upgrade to a newer version of fairseq? I see from the directory name it says 0.6.0.

Thank you very much for your advice. According to your suggestion, my problem has been solved perfectly. Thank you once again!

Top Results From Across the Web

19 Signs Your Fitness Routine Is 'Working' That Have Nothing ...

1. Your consistency improves. ... Most people just getting started with fitness have an erratic relationship with exercise, Morit Summers, an NSCA ...

6 Signs Your Fitness Routine Is Working—Besides Weight Loss

If you've been lifting the same weight for the same amount of reps for a while, take note of when it feels lighter....

Is My Workout Too Easy? Here's How to Tell - Daily Burn

When incorporating strength training into your routine, your heart rate isn't necessarily going to be an accurate indicator of workout ...

Example of a Balanced Weekly Workout Schedule - Shape

Monday: Upper-body strength training (45 to 60 minutes ) · Tuesday: Lower-body strength training (30 to 60 minutes) · Wednesday: Yoga or a...

How to tell your workout is working - Holmes Place

Knowing the positive signs of a good workout is a huge help in fitness progression. If your fitness routine isn't delivering the results...