Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

resnet50 doesn't converge when running example/imagenet/main.py on imagenet dataset with fp16

See original GitHub issue

i want use example/imagenet/main.py to train resnet50 model on imagenet dataset with fp16. But the accurary can’t converge. BTW, not using fp16 will get right top1 accurary=76%.

my command is:

python -m torch.distributed.launch --nproc_per_node=8 main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 /imagenet

Python version: 3.6.2
PyTorch version: 0.4.1
torchvision version: 0.2.1
OS: Ubuntu 16.04.3 LTS
Nvidia driver version: 390.46
CUDA runtime version: 9.0
GPU number: 8
GPU model: Tesla P100-PCIE

the validate accurary suddenly fall down to 0 after about 7 epochs. the train accurary suddenly fall down to 0 after about 17 epochs.

i saved model’s gradient( para.grad )each epoch, i found when epoch=17, the data distribution of model’s parameter( para.data） is normal, but 84.5% of gradient data( para.grad) is NaN.

the accurary result as follows,

train				validate
epoch	Top1	Top5	Loss	epoch	Top1	Top5	Loss
0	3.166	9.401	6.0748	0	3.054	9.308	12.1711
1	15.428	34.406	4.4438	1	18.206	39.778	4.1692
2	26.628	50.356	3.6012	2	29.108	54.764	3.3741
3	34.069	59.227	3.1205	3	30.938	56.796	3.2538
4	37.787	63.202	2.8991	4	29.46	55.652	3.3615
5	40.33	65.834	2.7536	5	12.982	30.574	5.1418
6	42.476	67.836	2.6325	6	0.428	1.608	8.4904
7	43.851	69.086	2.5574	7	0.1	0.502	8.2962
8	44.888	70.058	2.5005	8	0.1	0.49	15.8809
9	45.692	70.684	2.4588	9	0.1	0.5	83.5319
10	46.378	71.274	2.4261	10	0.104	0.496	184.0083
11	46.66	71.618	2.4065	11	0.1	0.504	210.9373
12	46.938	71.805	2.3928	12	0.1	0.5	585.1285
13	47.039	71.931	2.3873	13	0.1	0.5	2283.96
14	46.974	71.87	2.393	14	0	0.006	1612.295
15	46.667	71.499	2.4104	15	0.002	0.006	7.0508
16	46.273	71.155	2.4337	16	0.002	0.006	1554.635
17	16.3	25.251	5.3414	17	0.1	0.5	8.9353
18	0.096	0.482	6.9067	18	0.1	0.5	7.0235
19	0.095	0.485	6.9068	19	0.1	0.5	6.911
20	0.097	0.488	6.9068	20	0.1	0.5	6.9091
21	0.094	0.491	6.9067	21	0.1	0.5	6.9086
22	0.094	0.487	6.9066	22	0.1	0.5	6.9085
23	0.095	0.478	6.9066	23	0.1	0.5	6.9085
24	0.101	0.491	6.9067	24	0.1	0.5	6.9082
25	0.098	0.487	6.9067	25	0.1	0.5	6.9083
26	0.097	0.483	6.9068	26	0.1	0.5	6.908
27	0.099	0.485	6.9067	27	0.1	0.5	6.9082
28	0.091	0.489	6.9067	28	0.1	0.5	6.9085
29	0.097	0.489	6.9067	29	0.1	0.5	6.9083
30	0.1	0.503	6.9065	30	0.1	0.5	6.908
31	0.1	0.496	6.9063	31	0.1	0.5	6.9078
32	0.098	0.487	6.9063	32	0.1	0.5	6.908
33	0.092	0.472	6.9063	33	0.1	0.5	6.9078
34	0.092	0.469	6.9063	34	0.1	0.5	6.9078
35	0.095	0.461	6.9063	35	0.1	0.5	6.9078
36	0.093	0.463	6.9063	36	0.1	0.5	6.9078
37	0.086	0.459	6.9062	37	0.1	0.5	6.908
38	0.089	0.467	6.9063	38	0.1	0.5	6.9078
39	0.092	0.469	6.9063	39	0.1	0.5	6.908
40	0.092	0.461	6.9063	40	0.1	0.5	6.908
41	0.095	0.459	6.9063	41	0.1	0.5	6.9078
42	0.09	0.46	6.9063	42	0.1	0.5	6.9078
43	0.09	0.461	6.9063	43	0.1	0.5	6.908
44	0.093	0.463	6.9063	44	0.1	0.5	6.908
45	0.094	0.464	6.9063	45	0.1	0.5	6.9078
46	0.09	0.457	6.9063	46	0.1	0.5	6.9078
47	0.091	0.466	6.9063	47	0.1	0.5	6.908
48	0.089	0.465	6.9063	48	0.1	0.5	6.9078
49	0.09	0.449	6.9063	49	0.1	0.5	6.9078
50	0.094	0.46	6.9063	50	0.1	0.5	6.9078
51	0.094	0.464	6.9063	51	0.1	0.5	6.908
52	0.092	0.473	6.9063	52	0.1	0.5	6.9078
53	0.094	0.462	6.9063	53	0.1	0.5	6.9078
54	0.088	0.468	6.9063	54	0.1	0.5	6.9078
55	0.091	0.453	6.9063	55	0.1	0.5	6.9078
56	0.091	0.45	6.9064	56	0.1	0.5	6.9078
57	0.093	0.472	6.9063	57	0.1	0.5	6.9077
58	0.09	0.455	6.9063	58	0.1	0.5	6.9077
59	0.091	0.464	6.9063	59	0.1	0.5	6.9078
60	0.099	0.491	6.9063	60	0.1	0.5	6.9078
61	0.101	0.499	6.9063	61	0.1	0.5	6.908
62	0.1	0.496	6.9063	62	0.1	0.5	6.908
63	0.1	0.487	6.9063	63	0.1	0.5	6.9082
64	0.098	0.483	6.9063	64	0.1	0.5	6.9082
65	0.094	0.461	6.9064	65	0.1	0.5	6.9082
66	0.091	0.467	6.9063	66	0.1	0.5	6.9082
67	0.093	0.466	6.9064	67	0.1	0.5	6.9082
68	0.097	0.471	6.9063	68	0.1	0.5	6.9082
69	0.088	0.461	6.9064	69	0.1	0.5	6.9082
70	0.093	0.459	6.9063	70	0.1	0.5	6.9082
71	0.096	0.473	6.9064	71	0.1	0.5	6.9083
72	0.092	0.471	6.9064	72	0.1	0.5	6.9082
73	0.095	0.464	6.9064	73	0.1	0.5	6.9083
74	0.092	0.464	6.9063	74	0.1	0.5	6.9083
75	0.09	0.462	6.9064	75	0.1	0.5	6.9083
76	0.093	0.467	6.9064	76	0.1	0.5	6.9083
77	0.091	0.467	6.9064	77	0.1	0.5	6.9083
78	0.092	0.455	6.9064	78	0.1	0.5	6.9083
79	0.09	0.459	6.9064	79	0.1	0.5	6.9082
80	0.095	0.493	6.9064	80	0.1	0.5	6.9082
81	0.094	0.486	6.9064	81	0.1	0.5	6.9082
82	0.099	0.487	6.9064	82	0.1	0.5	6.9082
83	0.094	0.498	6.9064	83	0.1	0.5	6.9082
84	0.096	0.492	6.9064	84	0.1	0.5	6.9082
85	0.097	0.487	6.9064	85	0.1	0.5	6.9083
86	0.096	0.492	6.9064	86	0.1	0.5	6.9082
87	0.1	0.493	6.9065	87	0.1	0.5	6.9083
88	0.099	0.482	6.9064	88	0.1	0.5	6.9083
89	0.097	0.498	6.9064	89	0.1	0.5	6.9082

when i got the wrong result, i run the main.py on another server with two V100. I only use one, because when i want use two, i got the trouble described in pytorch issue 11327. But it does not affect my training…

python main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 /imagenet

Then i got the same wrong result, it is very like the above results.

I want to know whether you have tested the main.pyon imagenet dataset with fp16,and got a good accurary, like top1=76% described in the paper MIXED PRECISION TRAINING.

Issue Analytics

State:
Created 5 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

mcarillicommented, Sep 14, 2018

Yes, we have tested full convergence with fp16.

I noticed you aren’t using loss scaling. For mixed precision training, loss scaling is an important step (although it is not necessary for FP32 training). main.py supports static loss scaling, which uses a constant loss scale throughout training. I believe for our convergence runs we used a static loss scale of 128:

python main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 --static-loss-scale 128 /imagenet

If you wish to try dynamic loss scaling instead, which automatically adjusts the loss scale whenever it encounters a NaN/inf, you can try running the main_fp16_optimizer example instead:

python main_fp16_optimizer.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 --dynamic-loss-scale /imagenet

Also, in general, V100 is better than P100 for mixed precision training. Volta tensor cores take in FP16 data and do the accumulate for gemms and convolutions in FP32. Pascal doesn’t have tensor cores (it only supports FP16 through vectorized instructions) ~~so it is forced to do accumulates in FP16, which is less stable.~~ Edit: I’m told Pytorch on P100 is smart enough to avoid calling cublas/cudnn functions that perform accumulates in FP16. Instead, when operating on FP16 data, it calls cublas/cudnn functions that internally upconvert to FP32 arithmetic. So the numerical stability of these ops on P100 should be roughly equivalent to V100, but they still don’t have tensor cores, so naturally this will not give the same performance as V100.

0reactions

mcarillicommented, Sep 19, 2018

I wouldn’t worry too much about that. It’s probably not your fault, or the script’s fault. The paper says they used Pytorch, but it doesn’t say which version of cudnn/cublas they were using, which can definitely make a difference. In my opinion it’s better to use loss scaling for safety, and understand the theoretical reasons why loss scaling is helpful.

This talk starting at slide 12 http://on-demand.gputechconf.com/gtc/2018/presentation/s8923-training-neural-networks-with-mixed-precision-theory-and-practice.pdf and this talk starting at slide 31 http://on-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal Speaker_Michael Carilli_PDF For Sharing.pdf are good resources.

Top Results From Across the Web

Why is my resnet50 model in Keras not converging?

Try removing that layer.trainable = True for loop. And write it just under the base_model = ResNet50(weights='imagenet', include_top=False, ...

Why does my training loss of ResNet50 not converge?

I use nn.MESLoss() and Adam optimizer with learning rate of 0.0001 to train this network, but I find that the MSE loss does...

ImageNet/ResNet-50 Training in 224 Seconds

The first issue is convergence accuracy degradation with large mini-batch training [1]. [2]. The second issue is communication overhead of gradient ...

ImageNet Training in PyTorch - NVIDIA Documentation Center

This implements training of popular model architectures, such as ResNet, AlexNet, and VGG on the ImageNet dataset. This version has been modified to...

ImageNet Training in Minutes - UC Berkeley EECS

the effectiveness on two neural networks: AlexNet and ResNet-50 trained with the ImageNet-1k dataset while preserving the state-.