Is my training routine normal?
See original GitHub issueHello, I’m a newcomer to NLP. I have tried to install cuda, cudnn, NCCL and pytorch myself, but I don’t know if my training process is normal. Here is my training log:
| [src] dictionary: 40000 types
| [tgt] dictionary: 50000 types
| ./data/train_data train 10000000 examples
| ./data/train_data valid 3000 examples
| model transformer_wmt_en_de, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 115818496
| training on 4 GPUs
| max tokens per GPU = 2048 and max sentences per GPU = 2000
| WARNING: overflow detected, setting loss scale to: 64.0
| WARNING: overflow detected, setting loss scale to: 32.0
| WARNING: overflow detected, setting loss scale to: 16.0
| WARNING: overflow detected, setting loss scale to: 8.0
| epoch 001: 50 / 9199 loss=15.519, nll_loss=15.428, ppl=44083.35, wps=12245, ups=0.6, wpb=21134, bsz=1042, num_updates=47, lr=5.97383e-06, gnorm=4.899, clip=0%, oom=0, loss_scale=8.000, wall=81, train_wall=15
| epoch 001: 100 / 9199 loss=14.736, nll_loss=14.554, ppl=24057.37, wps=21332, ups=1.0, wpb=21317, bsz=1078, num_updates=97, lr=1.22226e-05, gnorm=3.181, clip=0%, oom=0, loss_scale=8.000, wall=97, train_wall=29
| epoch 001: 150 / 9199 loss=14.285, nll_loss=14.053, ppl=17000.59, wps=27956, ups=1.3, wpb=21369, bsz=1083, num_updates=147, lr=1.84713e-05, gnorm=2.565, clip=0%, oom=0, loss_scale=8.000, wall=112, train_wall=43
| epoch 001: 200 / 9199 loss=13.904, nll_loss=13.630, ppl=12680.30, wps=32875, ups=1.5, wpb=21439, bsz=1085, num_updates=197, lr=2.47201e-05, gnorm=2.289, clip=0%, oom=0, loss_scale=8.000, wall=128, train_wall=57
| epoch 001: 250 / 9199 loss=13.551, nll_loss=13.238, ppl=9658.64, wps=36651, ups=1.7, wpb=21473, bsz=1090, num_updates=247, lr=3.09688e-05, gnorm=2.186, clip=0%, oom=0, loss_scale=8.000, wall=145, train_wall=71
| epoch 001: 300 / 9199 loss=13.227, nll_loss=12.872, ppl=7495.85, wps=39749, ups=1.8, wpb=21536, bsz=1093, num_updates=297, lr=3.72176e-05, gnorm=2.070, clip=0%, oom=0, loss_scale=8.000, wall=161, train_wall=86
| epoch 001: 350 / 9199 loss=12.948, nll_loss=12.554, ppl=6015.14, wps=42313, ups=2.0, wpb=21539, bsz=1097, num_updates=347, lr=4.34663e-05, gnorm=1.959, clip=0%, oom=0, loss_scale=8.000, wall=177, train_wall=100
| epoch 001: 400 / 9199 loss=12.706, nll_loss=12.276, ppl=4960.26, wps=44491, ups=2.1, wpb=21546, bsz=1100, num_updates=397, lr=4.97151e-05, gnorm=1.907, clip=0%, oom=0, loss_scale=8.000, wall=192, train_wall=114
| epoch 001: 450 / 9199 loss=12.507, nll_loss=12.045, ppl=4226.99, wps=46169, ups=2.1, wpb=21487, bsz=1093, num_updates=447, lr=5.59638e-05, gnorm=1.836, clip=0%, oom=0, loss_scale=8.000, wall=208, train_wall=128
| epoch 001: 500 / 9199 loss=12.322, nll_loss=11.830, ppl=3640.19, wps=47629, ups=2.2, wpb=21497, bsz=1095, num_updates=497, lr=6.22126e-05, gnorm=1.774, clip=0%, oom=0, loss_scale=8.000, wall=224, train_wall=143
| epoch 001: 550 / 9199 loss=12.156, nll_loss=11.637, ppl=3185.81, wps=48939, ups=2.3, wpb=21487, bsz=1093, num_updates=547, lr=6.84613e-05, gnorm=1.741, clip=0%, oom=0, loss_scale=8.000, wall=240, train_wall=157
| epoch 001: 600 / 9199 loss=12.012, nll_loss=11.469, ppl=2835.67, wps=49972, ups=2.3, wpb=21460, bsz=1090, num_updates=597, lr=7.47101e-05, gnorm=1.707, clip=0%, oom=0, loss_scale=8.000, wall=256, train_wall=171
| epoch 001: 650 / 9199 loss=11.878, nll_loss=11.313, ppl=2543.67, wps=50990, ups=2.4, wpb=21440, bsz=1087, num_updates=647, lr=8.09588e-05, gnorm=1.670, clip=0%, oom=0, loss_scale=8.000, wall=272, train_wall=185
| epoch 001: 700 / 9199 loss=11.756, nll_loss=11.170, ppl=2304.89, wps=51943, ups=2.4, wpb=21428, bsz=1084, num_updates=697, lr=8.72076e-05, gnorm=1.646, clip=0%, oom=0, loss_scale=8.000, wall=288, train_wall=199
| epoch 001: 750 / 9199 loss=11.642, nll_loss=11.038, ppl=2102.54, wps=52763, ups=2.5, wpb=21417, bsz=1081, num_updates=747, lr=9.34563e-05, gnorm=1.623, clip=0%, oom=0, loss_scale=8.000, wall=303, train_wall=214
| epoch 001: 800 / 9199 loss=11.535, nll_loss=10.913, ppl=1928.65, wps=53370, ups=2.5, wpb=21403, bsz=1081, num_updates=797, lr=9.97051e-05, gnorm=1.607, clip=0%, oom=0, loss_scale=8.000, wall=320, train_wall=229
| epoch 001: 850 / 9199 loss=11.434, nll_loss=10.795, ppl=1776.92, wps=54017, ups=2.5, wpb=21407, bsz=1084, num_updates=847, lr=0.000105954, gnorm=1.597, clip=0%, oom=0, loss_scale=8.000, wall=336, train_wall=243
| epoch 001: 900 / 9199 loss=11.339, nll_loss=10.685, ppl=1646.52, wps=54628, ups=2.6, wpb=21397, bsz=1082, num_updates=897, lr=0.000112203, gnorm=1.575, clip=0%, oom=0, loss_scale=8.000, wall=351, train_wall=257
| epoch 001: 950 / 9199 loss=11.249, nll_loss=10.579, ppl=1530.19, wps=55213, ups=2.6, wpb=21403, bsz=1083, num_updates=947, lr=0.000118451, gnorm=1.560, clip=0%, oom=0, loss_scale=8.000, wall=367, train_wall=271
| epoch 001: 1000 / 9199 loss=11.161, nll_loss=10.477, ppl=1425.51, wps=55661, ups=2.6, wpb=21414, bsz=1083, num_updates=997, lr=0.0001247, gnorm=1.538, clip=0%, oom=0, loss_scale=8.000, wall=384, train_wall=286
| epoch 001 | valid on 'valid' subset | valid_loss 8.36527 | valid_nll_loss 7.22753 | valid_ppl 149.87 | num_updates 1000
| epoch 001: 1050 / 9199 loss=11.080, nll_loss=10.384, ppl=1335.81, wps=55444, ups=2.6, wpb=21405, bsz=1079, num_updates=1047, lr=0.000130949, gnorm=1.521, clip=0%, oom=0, loss_scale=8.000, wall=404, train_wall=300
| epoch 001: 1100 / 9199 loss=11.001, nll_loss=10.291, ppl=1252.64, wps=55914, ups=2.6, wpb=21418, bsz=1079, num_updates=1097, lr=0.000137198, gnorm=1.499, clip=0%, oom=0, loss_scale=8.000, wall=420, train_wall=315
| epoch 001: 1150 / 9199 loss=10.923, nll_loss=10.201, ppl=1176.73, wps=56419, ups=2.6, wpb=21443, bsz=1080, num_updates=1147, lr=0.000143446, gnorm=1.484, clip=0%, oom=0, loss_scale=8.000, wall=436, train_wall=329
I found that the parameter wps is about 50K, is this normal? I am worried that the installation of cudnn or NCCL has problems, which leads to slow training speed.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (5 by maintainers)
Top Results From Across the Web
19 Signs Your Fitness Routine Is 'Working' That Have Nothing ...
1. Your consistency improves. ... Most people just getting started with fitness have an erratic relationship with exercise, Morit Summers, an NSCA ...
Read more >6 Signs Your Fitness Routine Is Working—Besides Weight Loss
If you've been lifting the same weight for the same amount of reps for a while, take note of when it feels lighter....
Read more >Is My Workout Too Easy? Here's How to Tell - Daily Burn
When incorporating strength training into your routine, your heart rate isn't necessarily going to be an accurate indicator of workout ...
Read more >Example of a Balanced Weekly Workout Schedule - Shape
Monday: Upper-body strength training (45 to 60 minutes ) · Tuesday: Lower-body strength training (30 to 60 minutes) · Wednesday: Yoga or a...
Read more >How to tell your workout is working - Holmes Place
Knowing the positive signs of a good workout is a huge help in fitness progression. If your fitness routine isn't delivering the results...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The kind of GPU is Tesla V100 , PyTorch is 1.1.0, NCCL is 2.x, cudn10.0 and cuDNN is 7.5.x. And the 4 GPUs are on the same machine.
Thank you very much for your advice. According to your suggestion, my problem has been solved perfectly. Thank you once again!