question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is my training routine normal?

See original GitHub issue

Hello, I’m a newcomer to NLP. I have tried to install cuda, cudnn, NCCL and pytorch myself, but I don’t know if my training process is normal. Here is my training log:

| [src] dictionary: 40000 types
| [tgt] dictionary: 50000 types
| ./data/train_data train 10000000 examples
| ./data/train_data valid 3000 examples
| model transformer_wmt_en_de, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 115818496
| training on 4 GPUs
| max tokens per GPU = 2048 and max sentences per GPU = 2000
| WARNING: overflow detected, setting loss scale to: 64.0
| WARNING: overflow detected, setting loss scale to: 32.0
| WARNING: overflow detected, setting loss scale to: 16.0
| WARNING: overflow detected, setting loss scale to: 8.0
| epoch 001:     50 / 9199 loss=15.519, nll_loss=15.428, ppl=44083.35, wps=12245, ups=0.6, wpb=21134, bsz=1042, num_updates=47, lr=5.97383e-06, gnorm=4.899, clip=0%, oom=0, loss_scale=8.000, wall=81, train_wall=15
| epoch 001:    100 / 9199 loss=14.736, nll_loss=14.554, ppl=24057.37, wps=21332, ups=1.0, wpb=21317, bsz=1078, num_updates=97, lr=1.22226e-05, gnorm=3.181, clip=0%, oom=0, loss_scale=8.000, wall=97, train_wall=29
| epoch 001:    150 / 9199 loss=14.285, nll_loss=14.053, ppl=17000.59, wps=27956, ups=1.3, wpb=21369, bsz=1083, num_updates=147, lr=1.84713e-05, gnorm=2.565, clip=0%, oom=0, loss_scale=8.000, wall=112, train_wall=43
| epoch 001:    200 / 9199 loss=13.904, nll_loss=13.630, ppl=12680.30, wps=32875, ups=1.5, wpb=21439, bsz=1085, num_updates=197, lr=2.47201e-05, gnorm=2.289, clip=0%, oom=0, loss_scale=8.000, wall=128, train_wall=57
| epoch 001:    250 / 9199 loss=13.551, nll_loss=13.238, ppl=9658.64, wps=36651, ups=1.7, wpb=21473, bsz=1090, num_updates=247, lr=3.09688e-05, gnorm=2.186, clip=0%, oom=0, loss_scale=8.000, wall=145, train_wall=71
| epoch 001:    300 / 9199 loss=13.227, nll_loss=12.872, ppl=7495.85, wps=39749, ups=1.8, wpb=21536, bsz=1093, num_updates=297, lr=3.72176e-05, gnorm=2.070, clip=0%, oom=0, loss_scale=8.000, wall=161, train_wall=86
| epoch 001:    350 / 9199 loss=12.948, nll_loss=12.554, ppl=6015.14, wps=42313, ups=2.0, wpb=21539, bsz=1097, num_updates=347, lr=4.34663e-05, gnorm=1.959, clip=0%, oom=0, loss_scale=8.000, wall=177, train_wall=100
| epoch 001:    400 / 9199 loss=12.706, nll_loss=12.276, ppl=4960.26, wps=44491, ups=2.1, wpb=21546, bsz=1100, num_updates=397, lr=4.97151e-05, gnorm=1.907, clip=0%, oom=0, loss_scale=8.000, wall=192, train_wall=114
| epoch 001:    450 / 9199 loss=12.507, nll_loss=12.045, ppl=4226.99, wps=46169, ups=2.1, wpb=21487, bsz=1093, num_updates=447, lr=5.59638e-05, gnorm=1.836, clip=0%, oom=0, loss_scale=8.000, wall=208, train_wall=128
| epoch 001:    500 / 9199 loss=12.322, nll_loss=11.830, ppl=3640.19, wps=47629, ups=2.2, wpb=21497, bsz=1095, num_updates=497, lr=6.22126e-05, gnorm=1.774, clip=0%, oom=0, loss_scale=8.000, wall=224, train_wall=143
| epoch 001:    550 / 9199 loss=12.156, nll_loss=11.637, ppl=3185.81, wps=48939, ups=2.3, wpb=21487, bsz=1093, num_updates=547, lr=6.84613e-05, gnorm=1.741, clip=0%, oom=0, loss_scale=8.000, wall=240, train_wall=157
| epoch 001:    600 / 9199 loss=12.012, nll_loss=11.469, ppl=2835.67, wps=49972, ups=2.3, wpb=21460, bsz=1090, num_updates=597, lr=7.47101e-05, gnorm=1.707, clip=0%, oom=0, loss_scale=8.000, wall=256, train_wall=171
| epoch 001:    650 / 9199 loss=11.878, nll_loss=11.313, ppl=2543.67, wps=50990, ups=2.4, wpb=21440, bsz=1087, num_updates=647, lr=8.09588e-05, gnorm=1.670, clip=0%, oom=0, loss_scale=8.000, wall=272, train_wall=185
| epoch 001:    700 / 9199 loss=11.756, nll_loss=11.170, ppl=2304.89, wps=51943, ups=2.4, wpb=21428, bsz=1084, num_updates=697, lr=8.72076e-05, gnorm=1.646, clip=0%, oom=0, loss_scale=8.000, wall=288, train_wall=199
| epoch 001:    750 / 9199 loss=11.642, nll_loss=11.038, ppl=2102.54, wps=52763, ups=2.5, wpb=21417, bsz=1081, num_updates=747, lr=9.34563e-05, gnorm=1.623, clip=0%, oom=0, loss_scale=8.000, wall=303, train_wall=214
| epoch 001:    800 / 9199 loss=11.535, nll_loss=10.913, ppl=1928.65, wps=53370, ups=2.5, wpb=21403, bsz=1081, num_updates=797, lr=9.97051e-05, gnorm=1.607, clip=0%, oom=0, loss_scale=8.000, wall=320, train_wall=229
| epoch 001:    850 / 9199 loss=11.434, nll_loss=10.795, ppl=1776.92, wps=54017, ups=2.5, wpb=21407, bsz=1084, num_updates=847, lr=0.000105954, gnorm=1.597, clip=0%, oom=0, loss_scale=8.000, wall=336, train_wall=243
| epoch 001:    900 / 9199 loss=11.339, nll_loss=10.685, ppl=1646.52, wps=54628, ups=2.6, wpb=21397, bsz=1082, num_updates=897, lr=0.000112203, gnorm=1.575, clip=0%, oom=0, loss_scale=8.000, wall=351, train_wall=257
| epoch 001:    950 / 9199 loss=11.249, nll_loss=10.579, ppl=1530.19, wps=55213, ups=2.6, wpb=21403, bsz=1083, num_updates=947, lr=0.000118451, gnorm=1.560, clip=0%, oom=0, loss_scale=8.000, wall=367, train_wall=271
| epoch 001:   1000 / 9199 loss=11.161, nll_loss=10.477, ppl=1425.51, wps=55661, ups=2.6, wpb=21414, bsz=1083, num_updates=997, lr=0.0001247, gnorm=1.538, clip=0%, oom=0, loss_scale=8.000, wall=384, train_wall=286
| epoch 001 | valid on 'valid' subset | valid_loss 8.36527 | valid_nll_loss 7.22753 | valid_ppl 149.87 | num_updates 1000
| epoch 001:   1050 / 9199 loss=11.080, nll_loss=10.384, ppl=1335.81, wps=55444, ups=2.6, wpb=21405, bsz=1079, num_updates=1047, lr=0.000130949, gnorm=1.521, clip=0%, oom=0, loss_scale=8.000, wall=404, train_wall=300
| epoch 001:   1100 / 9199 loss=11.001, nll_loss=10.291, ppl=1252.64, wps=55914, ups=2.6, wpb=21418, bsz=1079, num_updates=1097, lr=0.000137198, gnorm=1.499, clip=0%, oom=0, loss_scale=8.000, wall=420, train_wall=315
| epoch 001:   1150 / 9199 loss=10.923, nll_loss=10.201, ppl=1176.73, wps=56419, ups=2.6, wpb=21443, bsz=1080, num_updates=1147, lr=0.000143446, gnorm=1.484, clip=0%, oom=0, loss_scale=8.000, wall=436, train_wall=329

I found that the parameter wps is about 50K, is this normal? I am worried that the installation of cudnn or NCCL has problems, which leads to slow training speed.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jiezhangGtcommented, Jul 31, 2019

What kind of GPU? I see you are using 4, are they all on the same machine? What version of PyTorch, NCCL, cuDNN, etc?

The kind of GPU is Tesla V100 , PyTorch is 1.1.0, NCCL is 2.x, cudn10.0 and cuDNN is 7.5.x. And the 4 GPUs are on the same machine.

0reactions
jiezhangGtcommented, Aug 6, 2019

Your latest comment has a different architecture than your first comment, you changed from transformer_wmt_en_de to transformer_wmt_en_de_big. Different architectures will have different training speeds, so it’ll be helpful to keep that consistent.

Another thing I forgot to mention is that you should install apex: https://github.com/NVIDIA/apex/. Make sure to installing Apex with the CUDA and C++ extensions. Fairseq will automatically pick them up and it should improve speed by a decent amount.

Also can you upgrade to a newer version of fairseq? I see from the directory name it says 0.6.0.

Thank you very much for your advice. According to your suggestion, my problem has been solved perfectly. Thank you once again!

Read more comments on GitHub >

github_iconTop Results From Across the Web

19 Signs Your Fitness Routine Is 'Working' That Have Nothing ...
1. Your consistency improves. ... Most people just getting started with fitness have an erratic relationship with exercise, Morit Summers, an NSCA ...
Read more >
6 Signs Your Fitness Routine Is Working—Besides Weight Loss
If you've been lifting the same weight for the same amount of reps for a while, take note of when it feels lighter....
Read more >
Is My Workout Too Easy? Here's How to Tell - Daily Burn
When incorporating strength training into your routine, your heart rate isn't necessarily going to be an accurate indicator of workout ...
Read more >
Example of a Balanced Weekly Workout Schedule - Shape
Monday: Upper-body strength training (45 to 60 minutes ) · Tuesday: Lower-body strength training (30 to 60 minutes) · Wednesday: Yoga or a...
Read more >
How to tell your workout is working - Holmes Place
Knowing the positive signs of a good workout is a huge help in fitness progression. If your fitness routine isn't delivering the results...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found