resnet50 doesn't converge when running example/imagenet/main.py on imagenet dataset with fp16
See original GitHub issuei want use example/imagenet/main.py
to train resnet50 model on imagenet dataset with fp16. But the accurary can’t converge. BTW, not using fp16 will get right top1 accurary=76%.
my command is:
python -m torch.distributed.launch --nproc_per_node=8 main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 /imagenet
-
Python version: 3.6.2
-
PyTorch version: 0.4.1
-
torchvision version: 0.2.1
-
OS: Ubuntu 16.04.3 LTS
-
Nvidia driver version: 390.46
-
CUDA runtime version: 9.0
-
GPU number: 8
-
GPU model: Tesla P100-PCIE
the validate accurary suddenly fall down to 0 after about 7 epochs. the train accurary suddenly fall down to 0 after about 17 epochs.
i saved model’s gradient( para.grad
)each epoch, i found when epoch=17, the data distribution of model’s parameter( para.data
) is normal, but 84.5% of gradient data( para.grad
) is NaN.
the accurary result as follows,
train | validate | |||||||
---|---|---|---|---|---|---|---|---|
epoch | Top1 | Top5 | Loss | epoch | Top1 | Top5 | Loss | |
0 | 3.166 | 9.401 | 6.0748 | 0 | 3.054 | 9.308 | 12.1711 | |
1 | 15.428 | 34.406 | 4.4438 | 1 | 18.206 | 39.778 | 4.1692 | |
2 | 26.628 | 50.356 | 3.6012 | 2 | 29.108 | 54.764 | 3.3741 | |
3 | 34.069 | 59.227 | 3.1205 | 3 | 30.938 | 56.796 | 3.2538 | |
4 | 37.787 | 63.202 | 2.8991 | 4 | 29.46 | 55.652 | 3.3615 | |
5 | 40.33 | 65.834 | 2.7536 | 5 | 12.982 | 30.574 | 5.1418 | |
6 | 42.476 | 67.836 | 2.6325 | 6 | 0.428 | 1.608 | 8.4904 | |
7 | 43.851 | 69.086 | 2.5574 | 7 | 0.1 | 0.502 | 8.2962 | |
8 | 44.888 | 70.058 | 2.5005 | 8 | 0.1 | 0.49 | 15.8809 | |
9 | 45.692 | 70.684 | 2.4588 | 9 | 0.1 | 0.5 | 83.5319 | |
10 | 46.378 | 71.274 | 2.4261 | 10 | 0.104 | 0.496 | 184.0083 | |
11 | 46.66 | 71.618 | 2.4065 | 11 | 0.1 | 0.504 | 210.9373 | |
12 | 46.938 | 71.805 | 2.3928 | 12 | 0.1 | 0.5 | 585.1285 | |
13 | 47.039 | 71.931 | 2.3873 | 13 | 0.1 | 0.5 | 2283.96 | |
14 | 46.974 | 71.87 | 2.393 | 14 | 0 | 0.006 | 1612.295 | |
15 | 46.667 | 71.499 | 2.4104 | 15 | 0.002 | 0.006 | 7.0508 | |
16 | 46.273 | 71.155 | 2.4337 | 16 | 0.002 | 0.006 | 1554.635 | |
17 | 16.3 | 25.251 | 5.3414 | 17 | 0.1 | 0.5 | 8.9353 | |
18 | 0.096 | 0.482 | 6.9067 | 18 | 0.1 | 0.5 | 7.0235 | |
19 | 0.095 | 0.485 | 6.9068 | 19 | 0.1 | 0.5 | 6.911 | |
20 | 0.097 | 0.488 | 6.9068 | 20 | 0.1 | 0.5 | 6.9091 | |
21 | 0.094 | 0.491 | 6.9067 | 21 | 0.1 | 0.5 | 6.9086 | |
22 | 0.094 | 0.487 | 6.9066 | 22 | 0.1 | 0.5 | 6.9085 | |
23 | 0.095 | 0.478 | 6.9066 | 23 | 0.1 | 0.5 | 6.9085 | |
24 | 0.101 | 0.491 | 6.9067 | 24 | 0.1 | 0.5 | 6.9082 | |
25 | 0.098 | 0.487 | 6.9067 | 25 | 0.1 | 0.5 | 6.9083 | |
26 | 0.097 | 0.483 | 6.9068 | 26 | 0.1 | 0.5 | 6.908 | |
27 | 0.099 | 0.485 | 6.9067 | 27 | 0.1 | 0.5 | 6.9082 | |
28 | 0.091 | 0.489 | 6.9067 | 28 | 0.1 | 0.5 | 6.9085 | |
29 | 0.097 | 0.489 | 6.9067 | 29 | 0.1 | 0.5 | 6.9083 | |
30 | 0.1 | 0.503 | 6.9065 | 30 | 0.1 | 0.5 | 6.908 | |
31 | 0.1 | 0.496 | 6.9063 | 31 | 0.1 | 0.5 | 6.9078 | |
32 | 0.098 | 0.487 | 6.9063 | 32 | 0.1 | 0.5 | 6.908 | |
33 | 0.092 | 0.472 | 6.9063 | 33 | 0.1 | 0.5 | 6.9078 | |
34 | 0.092 | 0.469 | 6.9063 | 34 | 0.1 | 0.5 | 6.9078 | |
35 | 0.095 | 0.461 | 6.9063 | 35 | 0.1 | 0.5 | 6.9078 | |
36 | 0.093 | 0.463 | 6.9063 | 36 | 0.1 | 0.5 | 6.9078 | |
37 | 0.086 | 0.459 | 6.9062 | 37 | 0.1 | 0.5 | 6.908 | |
38 | 0.089 | 0.467 | 6.9063 | 38 | 0.1 | 0.5 | 6.9078 | |
39 | 0.092 | 0.469 | 6.9063 | 39 | 0.1 | 0.5 | 6.908 | |
40 | 0.092 | 0.461 | 6.9063 | 40 | 0.1 | 0.5 | 6.908 | |
41 | 0.095 | 0.459 | 6.9063 | 41 | 0.1 | 0.5 | 6.9078 | |
42 | 0.09 | 0.46 | 6.9063 | 42 | 0.1 | 0.5 | 6.9078 | |
43 | 0.09 | 0.461 | 6.9063 | 43 | 0.1 | 0.5 | 6.908 | |
44 | 0.093 | 0.463 | 6.9063 | 44 | 0.1 | 0.5 | 6.908 | |
45 | 0.094 | 0.464 | 6.9063 | 45 | 0.1 | 0.5 | 6.9078 | |
46 | 0.09 | 0.457 | 6.9063 | 46 | 0.1 | 0.5 | 6.9078 | |
47 | 0.091 | 0.466 | 6.9063 | 47 | 0.1 | 0.5 | 6.908 | |
48 | 0.089 | 0.465 | 6.9063 | 48 | 0.1 | 0.5 | 6.9078 | |
49 | 0.09 | 0.449 | 6.9063 | 49 | 0.1 | 0.5 | 6.9078 | |
50 | 0.094 | 0.46 | 6.9063 | 50 | 0.1 | 0.5 | 6.9078 | |
51 | 0.094 | 0.464 | 6.9063 | 51 | 0.1 | 0.5 | 6.908 | |
52 | 0.092 | 0.473 | 6.9063 | 52 | 0.1 | 0.5 | 6.9078 | |
53 | 0.094 | 0.462 | 6.9063 | 53 | 0.1 | 0.5 | 6.9078 | |
54 | 0.088 | 0.468 | 6.9063 | 54 | 0.1 | 0.5 | 6.9078 | |
55 | 0.091 | 0.453 | 6.9063 | 55 | 0.1 | 0.5 | 6.9078 | |
56 | 0.091 | 0.45 | 6.9064 | 56 | 0.1 | 0.5 | 6.9078 | |
57 | 0.093 | 0.472 | 6.9063 | 57 | 0.1 | 0.5 | 6.9077 | |
58 | 0.09 | 0.455 | 6.9063 | 58 | 0.1 | 0.5 | 6.9077 | |
59 | 0.091 | 0.464 | 6.9063 | 59 | 0.1 | 0.5 | 6.9078 | |
60 | 0.099 | 0.491 | 6.9063 | 60 | 0.1 | 0.5 | 6.9078 | |
61 | 0.101 | 0.499 | 6.9063 | 61 | 0.1 | 0.5 | 6.908 | |
62 | 0.1 | 0.496 | 6.9063 | 62 | 0.1 | 0.5 | 6.908 | |
63 | 0.1 | 0.487 | 6.9063 | 63 | 0.1 | 0.5 | 6.9082 | |
64 | 0.098 | 0.483 | 6.9063 | 64 | 0.1 | 0.5 | 6.9082 | |
65 | 0.094 | 0.461 | 6.9064 | 65 | 0.1 | 0.5 | 6.9082 | |
66 | 0.091 | 0.467 | 6.9063 | 66 | 0.1 | 0.5 | 6.9082 | |
67 | 0.093 | 0.466 | 6.9064 | 67 | 0.1 | 0.5 | 6.9082 | |
68 | 0.097 | 0.471 | 6.9063 | 68 | 0.1 | 0.5 | 6.9082 | |
69 | 0.088 | 0.461 | 6.9064 | 69 | 0.1 | 0.5 | 6.9082 | |
70 | 0.093 | 0.459 | 6.9063 | 70 | 0.1 | 0.5 | 6.9082 | |
71 | 0.096 | 0.473 | 6.9064 | 71 | 0.1 | 0.5 | 6.9083 | |
72 | 0.092 | 0.471 | 6.9064 | 72 | 0.1 | 0.5 | 6.9082 | |
73 | 0.095 | 0.464 | 6.9064 | 73 | 0.1 | 0.5 | 6.9083 | |
74 | 0.092 | 0.464 | 6.9063 | 74 | 0.1 | 0.5 | 6.9083 | |
75 | 0.09 | 0.462 | 6.9064 | 75 | 0.1 | 0.5 | 6.9083 | |
76 | 0.093 | 0.467 | 6.9064 | 76 | 0.1 | 0.5 | 6.9083 | |
77 | 0.091 | 0.467 | 6.9064 | 77 | 0.1 | 0.5 | 6.9083 | |
78 | 0.092 | 0.455 | 6.9064 | 78 | 0.1 | 0.5 | 6.9083 | |
79 | 0.09 | 0.459 | 6.9064 | 79 | 0.1 | 0.5 | 6.9082 | |
80 | 0.095 | 0.493 | 6.9064 | 80 | 0.1 | 0.5 | 6.9082 | |
81 | 0.094 | 0.486 | 6.9064 | 81 | 0.1 | 0.5 | 6.9082 | |
82 | 0.099 | 0.487 | 6.9064 | 82 | 0.1 | 0.5 | 6.9082 | |
83 | 0.094 | 0.498 | 6.9064 | 83 | 0.1 | 0.5 | 6.9082 | |
84 | 0.096 | 0.492 | 6.9064 | 84 | 0.1 | 0.5 | 6.9082 | |
85 | 0.097 | 0.487 | 6.9064 | 85 | 0.1 | 0.5 | 6.9083 | |
86 | 0.096 | 0.492 | 6.9064 | 86 | 0.1 | 0.5 | 6.9082 | |
87 | 0.1 | 0.493 | 6.9065 | 87 | 0.1 | 0.5 | 6.9083 | |
88 | 0.099 | 0.482 | 6.9064 | 88 | 0.1 | 0.5 | 6.9083 | |
89 | 0.097 | 0.498 | 6.9064 | 89 | 0.1 | 0.5 | 6.9082 |
when i got the wrong result, i run the main.py on another server with two V100. I only use one, because when i want use two, i got the trouble described in pytorch issue 11327. But it does not affect my training…
python main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 /imagenet
Then i got the same wrong result, it is very like the above results.
I want to know whether you have tested the main.py
on imagenet dataset with fp16
,and got a good accurary, like top1=76% described in the paper MIXED PRECISION TRAINING.
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (4 by maintainers)
Yes, we have tested full convergence with fp16.
I noticed you aren’t using loss scaling. For mixed precision training, loss scaling is an important step (although it is not necessary for FP32 training).
main.py
supports static loss scaling, which uses a constant loss scale throughout training. I believe for our convergence runs we used a static loss scale of 128:If you wish to try dynamic loss scaling instead, which automatically adjusts the loss scale whenever it encounters a NaN/inf, you can try running the
main_fp16_optimizer
example instead:Also, in general, V100 is better than P100 for mixed precision training. Volta tensor cores take in FP16 data and do the accumulate for gemms and convolutions in FP32. Pascal doesn’t have tensor cores (it only supports FP16 through vectorized instructions)
so it is forced to do accumulates in FP16, which is less stable.Edit: I’m told Pytorch on P100 is smart enough to avoid calling cublas/cudnn functions that perform accumulates in FP16. Instead, when operating on FP16 data, it calls cublas/cudnn functions that internally upconvert to FP32 arithmetic. So the numerical stability of these ops on P100 should be roughly equivalent to V100, but they still don’t have tensor cores, so naturally this will not give the same performance as V100.I wouldn’t worry too much about that. It’s probably not your fault, or the script’s fault. The paper says they used Pytorch, but it doesn’t say which version of cudnn/cublas they were using, which can definitely make a difference. In my opinion it’s better to use loss scaling for safety, and understand the theoretical reasons why loss scaling is helpful.
This talk starting at slide 12 http://on-demand.gputechconf.com/gtc/2018/presentation/s8923-training-neural-networks-with-mixed-precision-theory-and-practice.pdf and this talk starting at slide 31 http://on-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speaker_Michael Carilli_PDF For Sharing.pdf are good resources.