question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

resnet50 doesn't converge when running example/imagenet/main.py on imagenet dataset with fp16

See original GitHub issue

i want use example/imagenet/main.py to train resnet50 model on imagenet dataset with fp16. But the accurary can’t converge. BTW, not using fp16 will get right top1 accurary=76%.

my command is:

python -m torch.distributed.launch --nproc_per_node=8 main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 /imagenet
  • Python version: 3.6.2

  • PyTorch version: 0.4.1

  • torchvision version: 0.2.1

  • OS: Ubuntu 16.04.3 LTS

  • Nvidia driver version: 390.46

  • CUDA runtime version: 9.0

  • GPU number: 8

  • GPU model: Tesla P100-PCIE

the validate accurary suddenly fall down to 0 after about 7 epochs. the train accurary suddenly fall down to 0 after about 17 epochs.

i saved model’s gradient( para.grad )each epoch, i found when epoch=17, the data distribution of model’s parameter( para.data) is normal, but 84.5% of gradient data( para.grad) is NaN.

the accurary result as follows,

train validate
epoch Top1 Top5 Loss epoch Top1 Top5 Loss
0 3.166 9.401 6.0748 0 3.054 9.308 12.1711
1 15.428 34.406 4.4438 1 18.206 39.778 4.1692
2 26.628 50.356 3.6012 2 29.108 54.764 3.3741
3 34.069 59.227 3.1205 3 30.938 56.796 3.2538
4 37.787 63.202 2.8991 4 29.46 55.652 3.3615
5 40.33 65.834 2.7536 5 12.982 30.574 5.1418
6 42.476 67.836 2.6325 6 0.428 1.608 8.4904
7 43.851 69.086 2.5574 7 0.1 0.502 8.2962
8 44.888 70.058 2.5005 8 0.1 0.49 15.8809
9 45.692 70.684 2.4588 9 0.1 0.5 83.5319
10 46.378 71.274 2.4261 10 0.104 0.496 184.0083
11 46.66 71.618 2.4065 11 0.1 0.504 210.9373
12 46.938 71.805 2.3928 12 0.1 0.5 585.1285
13 47.039 71.931 2.3873 13 0.1 0.5 2283.96
14 46.974 71.87 2.393 14 0 0.006 1612.295
15 46.667 71.499 2.4104 15 0.002 0.006 7.0508
16 46.273 71.155 2.4337 16 0.002 0.006 1554.635
17 16.3 25.251 5.3414 17 0.1 0.5 8.9353
18 0.096 0.482 6.9067 18 0.1 0.5 7.0235
19 0.095 0.485 6.9068 19 0.1 0.5 6.911
20 0.097 0.488 6.9068 20 0.1 0.5 6.9091
21 0.094 0.491 6.9067 21 0.1 0.5 6.9086
22 0.094 0.487 6.9066 22 0.1 0.5 6.9085
23 0.095 0.478 6.9066 23 0.1 0.5 6.9085
24 0.101 0.491 6.9067 24 0.1 0.5 6.9082
25 0.098 0.487 6.9067 25 0.1 0.5 6.9083
26 0.097 0.483 6.9068 26 0.1 0.5 6.908
27 0.099 0.485 6.9067 27 0.1 0.5 6.9082
28 0.091 0.489 6.9067 28 0.1 0.5 6.9085
29 0.097 0.489 6.9067 29 0.1 0.5 6.9083
30 0.1 0.503 6.9065 30 0.1 0.5 6.908
31 0.1 0.496 6.9063 31 0.1 0.5 6.9078
32 0.098 0.487 6.9063 32 0.1 0.5 6.908
33 0.092 0.472 6.9063 33 0.1 0.5 6.9078
34 0.092 0.469 6.9063 34 0.1 0.5 6.9078
35 0.095 0.461 6.9063 35 0.1 0.5 6.9078
36 0.093 0.463 6.9063 36 0.1 0.5 6.9078
37 0.086 0.459 6.9062 37 0.1 0.5 6.908
38 0.089 0.467 6.9063 38 0.1 0.5 6.9078
39 0.092 0.469 6.9063 39 0.1 0.5 6.908
40 0.092 0.461 6.9063 40 0.1 0.5 6.908
41 0.095 0.459 6.9063 41 0.1 0.5 6.9078
42 0.09 0.46 6.9063 42 0.1 0.5 6.9078
43 0.09 0.461 6.9063 43 0.1 0.5 6.908
44 0.093 0.463 6.9063 44 0.1 0.5 6.908
45 0.094 0.464 6.9063 45 0.1 0.5 6.9078
46 0.09 0.457 6.9063 46 0.1 0.5 6.9078
47 0.091 0.466 6.9063 47 0.1 0.5 6.908
48 0.089 0.465 6.9063 48 0.1 0.5 6.9078
49 0.09 0.449 6.9063 49 0.1 0.5 6.9078
50 0.094 0.46 6.9063 50 0.1 0.5 6.9078
51 0.094 0.464 6.9063 51 0.1 0.5 6.908
52 0.092 0.473 6.9063 52 0.1 0.5 6.9078
53 0.094 0.462 6.9063 53 0.1 0.5 6.9078
54 0.088 0.468 6.9063 54 0.1 0.5 6.9078
55 0.091 0.453 6.9063 55 0.1 0.5 6.9078
56 0.091 0.45 6.9064 56 0.1 0.5 6.9078
57 0.093 0.472 6.9063 57 0.1 0.5 6.9077
58 0.09 0.455 6.9063 58 0.1 0.5 6.9077
59 0.091 0.464 6.9063 59 0.1 0.5 6.9078
60 0.099 0.491 6.9063 60 0.1 0.5 6.9078
61 0.101 0.499 6.9063 61 0.1 0.5 6.908
62 0.1 0.496 6.9063 62 0.1 0.5 6.908
63 0.1 0.487 6.9063 63 0.1 0.5 6.9082
64 0.098 0.483 6.9063 64 0.1 0.5 6.9082
65 0.094 0.461 6.9064 65 0.1 0.5 6.9082
66 0.091 0.467 6.9063 66 0.1 0.5 6.9082
67 0.093 0.466 6.9064 67 0.1 0.5 6.9082
68 0.097 0.471 6.9063 68 0.1 0.5 6.9082
69 0.088 0.461 6.9064 69 0.1 0.5 6.9082
70 0.093 0.459 6.9063 70 0.1 0.5 6.9082
71 0.096 0.473 6.9064 71 0.1 0.5 6.9083
72 0.092 0.471 6.9064 72 0.1 0.5 6.9082
73 0.095 0.464 6.9064 73 0.1 0.5 6.9083
74 0.092 0.464 6.9063 74 0.1 0.5 6.9083
75 0.09 0.462 6.9064 75 0.1 0.5 6.9083
76 0.093 0.467 6.9064 76 0.1 0.5 6.9083
77 0.091 0.467 6.9064 77 0.1 0.5 6.9083
78 0.092 0.455 6.9064 78 0.1 0.5 6.9083
79 0.09 0.459 6.9064 79 0.1 0.5 6.9082
80 0.095 0.493 6.9064 80 0.1 0.5 6.9082
81 0.094 0.486 6.9064 81 0.1 0.5 6.9082
82 0.099 0.487 6.9064 82 0.1 0.5 6.9082
83 0.094 0.498 6.9064 83 0.1 0.5 6.9082
84 0.096 0.492 6.9064 84 0.1 0.5 6.9082
85 0.097 0.487 6.9064 85 0.1 0.5 6.9083
86 0.096 0.492 6.9064 86 0.1 0.5 6.9082
87 0.1 0.493 6.9065 87 0.1 0.5 6.9083
88 0.099 0.482 6.9064 88 0.1 0.5 6.9083
89 0.097 0.498 6.9064 89 0.1 0.5 6.9082

when i got the wrong result, i run the main.py on another server with two V100. I only use one, because when i want use two, i got the trouble described in pytorch issue 11327. But it does not affect my training…

python main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 /imagenet

Then i got the same wrong result, it is very like the above results.

I want to know whether you have tested the main.pyon imagenet dataset with fp16,and got a good accurary, like top1=76% described in the paper MIXED PRECISION TRAINING.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mcarillicommented, Sep 14, 2018

Yes, we have tested full convergence with fp16.

I noticed you aren’t using loss scaling. For mixed precision training, loss scaling is an important step (although it is not necessary for FP32 training). main.py supports static loss scaling, which uses a constant loss scale throughout training. I believe for our convergence runs we used a static loss scale of 128:

python main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 --static-loss-scale 128 /imagenet

If you wish to try dynamic loss scaling instead, which automatically adjusts the loss scale whenever it encounters a NaN/inf, you can try running the main_fp16_optimizer example instead:

python main_fp16_optimizer.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 --dynamic-loss-scale /imagenet

Also, in general, V100 is better than P100 for mixed precision training. Volta tensor cores take in FP16 data and do the accumulate for gemms and convolutions in FP32. Pascal doesn’t have tensor cores (it only supports FP16 through vectorized instructions) so it is forced to do accumulates in FP16, which is less stable. Edit: I’m told Pytorch on P100 is smart enough to avoid calling cublas/cudnn functions that perform accumulates in FP16. Instead, when operating on FP16 data, it calls cublas/cudnn functions that internally upconvert to FP32 arithmetic. So the numerical stability of these ops on P100 should be roughly equivalent to V100, but they still don’t have tensor cores, so naturally this will not give the same performance as V100.

0reactions
mcarillicommented, Sep 19, 2018

I wouldn’t worry too much about that. It’s probably not your fault, or the script’s fault. The paper says they used Pytorch, but it doesn’t say which version of cudnn/cublas they were using, which can definitely make a difference. In my opinion it’s better to use loss scaling for safety, and understand the theoretical reasons why loss scaling is helpful.

This talk starting at slide 12 http://on-demand.gputechconf.com/gtc/2018/presentation/s8923-training-neural-networks-with-mixed-precision-theory-and-practice.pdf and this talk starting at slide 31 http://on-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speaker_Michael Carilli_PDF For Sharing.pdf are good resources.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why is my resnet50 model in Keras not converging?
Try removing that layer.trainable = True for loop. And write it just under the base_model = ResNet50(weights='imagenet', include_top=False, ...
Read more >
Why does my training loss of ResNet50 not converge?
I use nn.MESLoss() and Adam optimizer with learning rate of 0.0001 to train this network, but I find that the MSE loss does...
Read more >
ImageNet/ResNet-50 Training in 224 Seconds
The first issue is convergence accuracy degradation with large mini-batch training [1]. [2]. The second issue is communication overhead of gradient ...
Read more >
ImageNet Training in PyTorch - NVIDIA Documentation Center
This implements training of popular model architectures, such as ResNet, AlexNet, and VGG on the ImageNet dataset. This version has been modified to...
Read more >
ImageNet Training in Minutes - UC Berkeley EECS
the effectiveness on two neural networks: AlexNet and ResNet-50 trained with the ImageNet-1k dataset while preserving the state-.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found