question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

why train loss is too big

See original GitHub issue

I use default config about configs/faster_rcnn_r50_fpn_1x.py, and train.py parameters are configs/faster_rcnn_r50_fpn_1x.py

In the Epoch [1][1000/58633], the training loss becomes very big. Is this normal? Why?

and the log is that

/home/mf/anaconda3/envs/open-mmlab/bin/python /home/mf/w_public/mmdetection/tools/train.py configs/faster_rcnn_r50_fpn_1x.py
2019-07-30 10:02:20,710 - INFO - Distributed training: False
2019-07-30 10:02:21,079 - INFO - load model from: modelzoo://resnet50
2019-07-30 10:02:21,425 - WARNING - unexpected key in source state_dict: fc.weight, fc.bias

missing keys in source state_dict: layer1.2.bn1.num_batches_tracked, layer3.5.bn2.num_batches_tracked, bn1.num_batches_tracked, layer3.1.bn1.num_batches_tracked, layer4.0.bn2.num_batches_tracked, layer2.1.bn3.num_batches_tracked, layer3.2.bn1.num_batches_tracked, layer1.0.bn1.num_batches_tracked, layer3.0.bn2.num_batches_tracked, layer2.3.bn3.num_batches_tracked, layer4.1.bn1.num_batches_tracked, layer1.1.bn2.num_batches_tracked, layer4.0.bn3.num_batches_tracked, layer4.1.bn2.num_batches_tracked, layer2.0.downsample.1.num_batches_tracked, layer3.1.bn2.num_batches_tracked, layer4.0.bn1.num_batches_tracked, layer3.5.bn3.num_batches_tracked, layer2.2.bn2.num_batches_tracked, layer1.1.bn1.num_batches_tracked, layer3.1.bn3.num_batches_tracked, layer4.1.bn3.num_batches_tracked, layer2.2.bn1.num_batches_tracked, layer1.2.bn2.num_batches_tracked, layer4.0.downsample.1.num_batches_tracked, layer3.3.bn2.num_batches_tracked, layer3.4.bn2.num_batches_tracked, layer4.2.bn1.num_batches_tracked, layer3.0.downsample.1.num_batches_tracked, layer3.2.bn2.num_batches_tracked, layer3.2.bn3.num_batches_tracked, layer3.3.bn3.num_batches_tracked, layer3.0.bn1.num_batches_tracked, layer2.3.bn1.num_batches_tracked, layer3.4.bn3.num_batches_tracked, layer2.3.bn2.num_batches_tracked, layer3.5.bn1.num_batches_tracked, layer2.0.bn3.num_batches_tracked, layer3.3.bn1.num_batches_tracked, layer1.1.bn3.num_batches_tracked, layer3.4.bn1.num_batches_tracked, layer2.0.bn1.num_batches_tracked, layer2.1.bn1.num_batches_tracked, layer2.2.bn3.num_batches_tracked, layer3.0.bn3.num_batches_tracked, layer1.0.bn3.num_batches_tracked, layer4.2.bn2.num_batches_tracked, layer1.0.downsample.1.num_batches_tracked, layer2.0.bn2.num_batches_tracked, layer4.2.bn3.num_batches_tracked, layer2.1.bn2.num_batches_tracked, layer1.2.bn3.num_batches_tracked, layer1.0.bn2.num_batches_tracked

loading annotations into memory...
Done (t=9.63s)
creating index...
index created!
2019-07-30 10:02:35,009 - INFO - Start running, host: mf@mf-System-Product-Name, work_dir: /home/mf/w_public/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x
2019-07-30 10:02:35,009 - INFO - workflow: [('train', 1)], max: 12 epochs
2019-07-30 10:02:54,084 - INFO - Epoch [1][50/58633]	lr: 0.00797, eta: 3 days, 2:32:28, time: 0.381, data_time: 0.009, memory: 3791, loss_rpn_cls: 0.3375, loss_rpn_bbox: 0.0867, loss_cls: 0.6763, acc: 92.3008, loss_bbox: 0.1246, loss: 1.2251
2019-07-30 10:03:12,616 - INFO - Epoch [1][100/58633]	lr: 0.00931, eta: 3 days, 1:29:03, time: 0.371, data_time: 0.004, memory: 3791, loss_rpn_cls: 0.2140, loss_rpn_bbox: 0.0703, loss_cls: 0.5111, acc: 93.2188, loss_bbox: 0.1525, loss: 0.9479
2019-07-30 10:03:31,010 - INFO - Epoch [1][150/58633]	lr: 0.01064, eta: 3 days, 0:56:47, time: 0.368, data_time: 0.003, memory: 3791, loss_rpn_cls: 0.1666, loss_rpn_bbox: 0.0609, loss_cls: 0.5251, acc: 92.8848, loss_bbox: 0.1633, loss: 0.9159
2019-07-30 10:03:49,761 - INFO - Epoch [1][200/58633]	lr: 0.01197, eta: 3 days, 1:01:25, time: 0.375, data_time: 0.004, memory: 3791, loss_rpn_cls: 0.2267, loss_rpn_bbox: 0.0921, loss_cls: 0.6174, acc: 91.6387, loss_bbox: 0.1854, loss: 1.1217
2019-07-30 10:04:08,190 - INFO - Epoch [1][250/58633]	lr: 0.01331, eta: 3 days, 0:49:04, time: 0.369, data_time: 0.003, memory: 3791, loss_rpn_cls: 0.1873, loss_rpn_bbox: 0.0758, loss_cls: 0.6097, acc: 91.6562, loss_bbox: 0.1857, loss: 1.0585
2019-07-30 10:04:26,377 - INFO - Epoch [1][300/58633]	lr: 0.01464, eta: 3 days, 0:31:12, time: 0.364, data_time: 0.003, memory: 3791, loss_rpn_cls: 0.1744, loss_rpn_bbox: 0.0717, loss_cls: 0.5832, acc: 91.6348, loss_bbox: 0.1907, loss: 1.0200
2019-07-30 10:04:44,331 - INFO - Epoch [1][350/58633]	lr: 0.01597, eta: 3 days, 0:10:34, time: 0.359, data_time: 0.003, memory: 3791, loss_rpn_cls: 0.1876, loss_rpn_bbox: 0.0841, loss_cls: 0.5484, acc: 91.5840, loss_bbox: 0.1910, loss: 1.0112
2019-07-30 10:05:02,978 - INFO - Epoch [1][400/58633]	lr: 0.01731, eta: 3 days, 0:15:18, time: 0.373, data_time: 0.004, memory: 3791, loss_rpn_cls: 0.1606, loss_rpn_bbox: 0.0639, loss_cls: 0.6050, acc: 92.0977, loss_bbox: 0.1788, loss: 1.0083
2019-07-30 10:05:21,395 - INFO - Epoch [1][450/58633]	lr: 0.01864, eta: 3 days, 0:12:57, time: 0.368, data_time: 0.003, memory: 3791, loss_rpn_cls: 0.2056, loss_rpn_bbox: 0.0768, loss_cls: 0.6062, acc: 91.5117, loss_bbox: 0.1879, loss: 1.0766
2019-07-30 10:05:39,837 - INFO - Epoch [1][500/58633]	lr: 0.01997, eta: 3 days, 0:11:36, time: 0.369, data_time: 0.004, memory: 3791, loss_rpn_cls: 0.1487, loss_rpn_bbox: 0.0768, loss_cls: 0.5943, acc: 91.7441, loss_bbox: 0.1892, loss: 1.0090
2019-07-30 10:05:58,293 - INFO - Epoch [1][550/58633]	lr: 0.02000, eta: 3 days, 0:10:43, time: 0.369, data_time: 0.004, memory: 3791, loss_rpn_cls: 0.3145, loss_rpn_bbox: 0.1089, loss_cls: 0.4854, acc: 93.7539, loss_bbox: 0.1304, loss: 1.0391
2019-07-30 10:06:16,786 - INFO - Epoch [1][600/58633]	lr: 0.02000, eta: 3 days, 0:10:40, time: 0.370, data_time: 0.004, memory: 3791, loss_rpn_cls: 0.2104, loss_rpn_bbox: 0.0980, loss_cls: 0.5118, acc: 93.2090, loss_bbox: 0.1509, loss: 0.9711
2019-07-30 10:06:35,303 - INFO - Epoch [1][650/58633]	lr: 0.02000, eta: 3 days, 0:11:00, time: 0.370, data_time: 0.003, memory: 3791, loss_rpn_cls: 0.2558, loss_rpn_bbox: 0.1247, loss_cls: 0.5771, acc: 91.3262, loss_bbox: 0.1899, loss: 1.1476
2019-07-30 10:06:54,683 - INFO - Epoch [1][700/58633]	lr: 0.02000, eta: 3 days, 0:25:41, time: 0.388, data_time: 0.004, memory: 3791, loss_rpn_cls: 0.2355, loss_rpn_bbox: 0.0966, loss_cls: 0.4322, acc: 93.9688, loss_bbox: 0.1319, loss: 0.8962
2019-07-30 10:07:13,302 - INFO - Epoch [1][750/58633]	lr: 0.02000, eta: 3 days, 0:26:29, time: 0.372, data_time: 0.004, memory: 3791, loss_rpn_cls: 0.2131, loss_rpn_bbox: 0.0831, loss_cls: 0.4883, acc: 93.4316, loss_bbox: 0.1440, loss: 0.9285
2019-07-30 10:07:32,554 - INFO - Epoch [1][800/58633]	lr: 0.02000, eta: 3 days, 0:36:25, time: 0.385, data_time: 0.004, memory: 3791, loss_rpn_cls: 0.3003, loss_rpn_bbox: 0.1204, loss_cls: 0.5138, acc: 93.3008, loss_bbox: 0.1436, loss: 1.0781
2019-07-30 10:07:50,711 - INFO - Epoch [1][850/58633]	lr: 0.02000, eta: 3 days, 0:30:03, time: 0.363, data_time: 0.004, memory: 3791, loss_rpn_cls: 0.4217, loss_rpn_bbox: 0.2851, loss_cls: 0.8004, acc: 94.0859, loss_bbox: 0.1257, loss: 1.6328
2019-07-30 10:08:08,752 - INFO - Epoch [1][900/58633]	lr: 0.02000, eta: 3 days, 0:22:50, time: 0.361, data_time: 0.004, memory: 3791, loss_rpn_cls: 156.2920, loss_rpn_bbox: 62.4721, loss_cls: 364.4698, acc: 82.0712, loss_bbox: 41.8640, loss: 625.0979
2019-07-30 10:08:26,768 - INFO - Epoch [1][950/58633]	lr: 0.02000, eta: 3 days, 0:16:03, time: 0.360, data_time: 0.003, memory: 3791, loss_rpn_cls: 447235.8581, loss_rpn_bbox: 526061.7554, loss_cls: 4071407055.3989, acc: 80.6797, loss_bbox: 246750333.2189, loss: 4319130658.6691
2019-07-30 10:08:45,165 - INFO - Epoch [1][1000/58633]	lr: 0.02000, eta: 3 days, 0:14:23, time: 0.368, data_time: 0.004, memory: 3791, loss_rpn_cls: 663974819297698.0000, loss_rpn_bbox: 86506371132308.3125, loss_cls: 9498945394371206.0000, acc: 72.3992, loss_bbox: 333746037078607.0625, loss: 10583172569332912.0000
2019-07-30 10:09:03,294 - INFO - Epoch [1][1050/58633]	lr: 0.02000, eta: 3 days, 0:09:51, time: 0.363, data_time: 0.004, memory: 3791, loss_rpn_cls: 1364391539087953075634176.0000, loss_rpn_bbox: 567411899414833660952576.0000, loss_cls: 138728180119747246268874752.0000, acc: 91.6599, loss_bbox: 13771815597760854209069056.0000, loss: 154431800497686002255527936.0000
2019-07-30 10:09:21,490 - INFO - Epoch [1][1100/58633]	lr: 0.02000, eta: 3 days, 0:06:25, time: 0.364, data_time: 0.004, memory: 3791, loss_rpn_cls: 7248172595877238437576704.0000, loss_rpn_bbox: 2399984347958531225288704.0000, loss_cls: 749459289716433297625579520.0000, acc: 94.6113, loss_bbox: 77432702443750119213891584.0000, loss: 836540162127489252454301696.0000
2019-07-30 10:09:39,857 - INFO - Epoch [1][1150/58633]	lr: 0.02000, eta: 3 days, 0:05:00, time: 0.367, data_time: 0.003, memory: 3791, loss_rpn_cls: 7019707082215567669592064.0000, loss_rpn_bbox: 4737551173594944490176512.0000, loss_cls: 952746356550775618484043776.0000, acc: 94.1113, loss_bbox: 116985685205953382291865600.0000, loss: 1081489303707954757133926400.0000
2019-07-30 10:09:58,116 - INFO - Epoch [1][1200/58633]	lr: 0.02000, eta: 3 days, 0:02:37, time: 0.365, data_time: 0.003, memory: 3791, loss_rpn_cls: 6623643306337326952611840.0000, loss_rpn_bbox: 1677773494959891533529088.0000, loss_cls: 676409084427291037360193536.0000, acc: 95.1016, loss_bbox: 61674887396537672514142208.0000, loss: 746385394081858182862340096.0000
2019-07-30 10:10:16,655 - INFO - Epoch [1][1250/58633]	lr: 0.02000, eta: 3 days, 0:03:02, time: 0.371, data_time: 0.004, memory: 3791, loss_rpn_cls: 6306572437338864673095680.0000, loss_rpn_bbox: 2397626350338094825209856.0000, loss_cls: 702858893330790020814995456.0000, acc: 94.9922, loss_bbox: 85713648393600203787075584.0000, loss: 797276741770739209455271936.0000
2019-07-30 10:10:35,082 - INFO - Epoch [1][1300/58633]	lr: 0.02000, eta: 3 days, 0:02:22, time: 0.369, data_time: 0.004, memory: 3791, loss_rpn_cls: 5784541178443221308538880.0000, loss_rpn_bbox: 1592911890236547304259584.0000, loss_cls: 584144961218907285242249216.0000, acc: 94.7910, loss_bbox: 60132916379517915439824896.0000, loss: 651655322485019431535640576.0000
2019-07-30 10:10:53,984 - INFO - Epoch [1][1350/58633]	lr: 0.02000, eta: 3 days, 0:05:51, time: 0.378, data_time: 0.004, memory: 3791, loss_rpn_cls: 5342794156158588918169600.0000, loss_rpn_bbox: 1858728942494471866548224.0000, loss_cls: 592920503840435445530886144.0000, acc: 94.7520, loss_bbox: 69530009335973618376507392.0000, loss: 669652035893432084856832000.0000
2019-07-30 10:11:13,073 - INFO - Epoch [1][1400/58633]	lr: 0.02000, eta: 3 days, 0:10:38, time: 0.382, data_time: 0.004, memory: 3791, loss_rpn_cls: 5165773008838793266987008.0000, loss_rpn_bbox: 2311574336324449930838016.0000, loss_cls: 596068163900160765641883648.0000, acc: 94.6641, loss_bbox: 61864457022218112606404608.0000, loss: 665409970260547009163296768.0000
2019-07-30 10:11:31,742 - INFO - Epoch [1][1450/58633]	lr: 0.02000, eta: 3 days, 0:11:41, time: 0.373, data_time: 0.004, memory: 3791, loss_rpn_cls: 4610083048248850272747520.0000, loss_rpn_bbox: 1432003406626812493561856.0000, loss_cls: 548534009977692903822065664.0000, acc: 95.3594, loss_bbox: 61887055758006783571918848.0000, loss: 616463152732647734380593152.0000
2019-07-30 10:11:51,226 - INFO - Epoch [1][1500/58633]	lr: 0.02000, eta: 3 days, 0:18:59, time: 0.390, data_time: 0.004, memory: 3791, loss_rpn_cls: 4342793929019683183263744.0000, loss_rpn_bbox: 2127836828204648078770176.0000, loss_cls: 443916903405240911975677952.0000, acc: 95.3457, loss_bbox: 47187610007356827836088320.0000, loss: 497575150226156379118239744.0000
2019-07-30 10:12:11,386 - INFO - Epoch [1][1550/58633]	lr: 0.02000, eta: 3 days, 0:30:54, time: 0.403, data_time: 0.004, memory: 3791, loss_rpn_cls: 3947261021612846834778112.0000, loss_rpn_bbox: 1148280921352944960405504.0000, loss_cls: 456843004510417720971362304.0000, acc: 95.1367, loss_bbox: 43518018316285994678091776.0000, loss: 505456564082218576423944192.0000
2019-07-30 10:12:31,639 - INFO - Epoch [1][1600/58633]	lr: 0.02000, eta: 3 days, 0:42:44, time: 0.405, data_time: 0.004, memory: 3791, loss_rpn_cls: 3707260822292568108695552.0000, loss_rpn_bbox: 1797451293467601050533888.0000, loss_cls: 439920061639958484962246656.0000, acc: 94.9512, loss_bbox: 48417321447056923657502720.0000, loss: 493842091228754245505253376.0000
2019-07-30 10:12:50,501 - INFO - Epoch [1][1650/58633]	lr: 0.02000, eta: 3 days, 0:43:58, time: 0.377, data_time: 0.004, memory: 3791, loss_rpn_cls: 3512799281492788139524096.0000, loss_rpn_bbox: 1518530066902682387349504.0000, loss_cls: 435772218125044191227543552.0000, acc: 94.8496, loss_bbox: 40495116691692944713842688.0000, loss: 481298668451964109665075200.0000
2019-07-30 10:13:09,015 - INFO - Epoch [1][1700/58633]	lr: 0.02000, eta: 3 days, 0:42:42, time: 0.370, data_time: 0.004, memory: 3791, loss_rpn_cls: 3284510314701118226038784.0000, loss_rpn_bbox: 1162567653717180499886080.0000, loss_cls: 405972334976007277457178624.0000, acc: 94.9414, loss_bbox: 44356791104946548898791424.0000, loss: 454776203957101249268023296.0000
2019-07-30 10:13:27,367 - INFO - Epoch [1][1750/58633]	lr: 0.02000, eta: 3 days, 0:40:25, time: 0.367, data_time: 0.003, memory: 3791, loss_rpn_cls: 3066270079157628303310848.0000, loss_rpn_bbox: 887112520764302662041600.0000, loss_cls: 347684722754085661123280896.0000, acc: 94.7969, loss_bbox: 37238435013324089461833728.0000, loss: 388876538665704419208724480.0000
2019-07-30 10:13:46,028 - INFO - Epoch [1][1800/58633]	lr: 0.02000, eta: 3 days, 0:40:15, time: 0.373, data_time: 0.004, memory: 3791, loss_rpn_cls: 2904379300085843231768576.0000, loss_rpn_bbox: 1112800091037589653422080.0000, loss_cls: 320129707916297712938516480.0000, acc: 95.2461, loss_bbox: 34943331210680193794441216.0000, loss: 359090218761948872486944768.0000
2019-07-30 10:14:05,340 - INFO - Epoch [1][1850/58633]	lr: 0.02000, eta: 3 days, 0:44:12, time: 0.386, data_time: 0.004, memory: 3791, loss_rpn_cls: 2566878010625097048522752.0000, loss_rpn_bbox: 981360457758289661263872.0000, loss_cls: 371481038657838346076684288.0000, acc: 94.1621, loss_bbox: 37869240806814503221067776.0000, loss: 412898520449593731773890560.0000
2019-07-30 10:14:24,521 - INFO - Epoch [1][1900/58633]	lr: 0.02000, eta: 3 days, 0:47:06, time: 0.384, data_time: 0.004, memory: 3791, loss_rpn_cls: 2506677724270411290509312.0000, loss_rpn_bbox: 810762719964234931765248.0000, loss_cls: 297558388428070848821198848.0000, acc: 94.8594, loss_bbox: 29505186815108458238443520.0000, loss: 330381013052977505187659776.0000
2019-07-30 10:14:44,188 - INFO - Epoch [1][1950/58633]	lr: 0.02000, eta: 3 days, 0:52:46, time: 0.393, data_time: 0.004, memory: 3791, loss_rpn_cls: 2327605172188252492267520.0000, loss_rpn_bbox: 944301412305724663398400.0000, loss_cls: 263073636518943062465970176.0000, acc: 94.9668, loss_bbox: 32365094862387531321704448.0000, loss: 298710635381363799664623616.0000
2019-07-30 10:15:03,766 - INFO - Epoch [1][2000/58633]	lr: 0.02000, eta: 3 days, 0:57:36, time: 0.392, data_time: 0.004, memory: 3791, loss_rpn_cls: 2120199917124528168239104.0000, loss_rpn_bbox: 569833194926640730210304.0000, loss_cls: 315955675236003999192711168.0000, acc: 94.1309, loss_bbox: 37569526119027021277298688.0000, loss: 356215231348066288970760192.0000
2019-07-30 10:15:24,328 - INFO - Epoch [1][2050/58633]	lr: 0.02000, eta: 3 days, 1:07:49, time: 0.411, data_time: 0.004, memory: 3791, loss_rpn_cls: 2019840936335403697307648.0000, loss_rpn_bbox: 746494062373447116259328.0000, loss_cls: 211094436197562531918643200.0000, acc: 94.5684, loss_bbox: 20937678238341660638445568.0000, loss: 234798447591566279381614592.0000

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
hellockcommented, Aug 8, 2019

Please read GETTING_STARTED.md.

Important: The default learning rate in config files is for 8 GPUs and 2 img/gpu (batch size = 8*2 = 16). According to the Linear Scaling Rule, you need to set the learning rate proportional to the batch size if you use different GPUs or images per GPU, e.g., lr=0.01 for 4 GPUs * 2 img/gpu and lr=0.08 for 16 GPUs * 4 img/gpu.

0reactions
mengfu188commented, Aug 9, 2019

Thank you very much. Forgive my folly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What happens if optimal training loss is too high - Stack Overflow
Regarding your first question - it is not necessarily a problem that your training loss is high, since there is no threshold for...
Read more >
Extremely large spike in training loss that destroys training ...
Inspecting the prediction results after the spike showed that the training progress is basically destroyed and started over, with worse accuracy ...
Read more >
Your validation loss is lower than your training loss? This is why!
Symptoms: validation loss is consistently lower than the training loss, the gap between them remains more or less the same size and training...
Read more >
the training loss is too big and never changed #124 - GitHub
However, the loss become too big ( batch size is 2, the loss is 1572846, learning rate is 1e-4) , and it changed...
Read more >
Training loss not decrease after certain epochs - Kaggle
Is the network size is too small / large? Check overfitting or underfitting by train history, then chose the best epoch size. Try...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found