question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Speed up training?

See original GitHub issue

Hi, I’m trying to retrain the coref model starting from another BERT model trained on different data. It seems the loss values are not going down but another issue is that training seems slow and the GPU is underutilized (screenshot below). Any tips on how to speed up the training or fit more data in the gpu?

TITAN V          | 40'C,   0 % |   316 / 12036 MB
I0905 12:20:54.217084 140321920943872 train.py:59] [100] loss=2071.04, steps/s=0.20
I0905 12:21:45.037879 140321920943872 train.py:59] [110] loss=837.87, steps/s=0.20
I0905 12:22:37.386365 140321920943872 train.py:59] [120] loss=1475.69, steps/s=0.20
I0905 12:23:34.424523 140321920943872 train.py:59] [130] loss=1111.34, steps/s=0.20
I0905 12:24:26.693988 140321920943872 train.py:59] [140] loss=1088.69, steps/s=0.20
I0905 12:25:14.780310 140321920943872 train.py:59] [150] loss=792.43, steps/s=0.20
I0905 12:26:08.272615 140321920943872 train.py:59] [160] loss=1597.89, steps/s=0.20
I0905 12:26:55.389269 140321920943872 train.py:59] [170] loss=1087.88, steps/s=0.20

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mandarjoshi90commented, Sep 6, 2019

Around 55K steps for SpanBERT base. Here’s the final part of the log.

2019-06-27 00:47:50,244 - INFO - __main__ - [54000] evaL_f1=0.7766, max_f1=0.7766
2019-06-27 00:48:29,488 - INFO - __main__ - [54100] loss=2.11, steps/s=2.11
2019-06-27 00:49:05,383 - INFO - __main__ - [54200] loss=1.67, steps/s=2.11
2019-06-27 00:49:44,517 - INFO - __main__ - [54300] loss=1.57, steps/s=2.11
2019-06-27 00:50:22,661 - INFO - __main__ - [54400] loss=1.81, steps/s=2.11
2019-06-27 00:51:03,515 - INFO - __main__ - [54500] loss=2.94, steps/s=2.11
2019-06-27 00:51:43,155 - INFO - __main__ - [54600] loss=1.69, steps/s=2.11
2019-06-27 00:52:16,367 - INFO - __main__ - [54700] loss=2.86, steps/s=2.11
2019-06-27 00:52:52,681 - INFO - __main__ - [54800] loss=0.89, steps/s=2.11
2019-06-27 00:53:35,524 - INFO - __main__ - [54900] loss=2.33, steps/s=2.11
2019-06-27 00:54:16,111 - INFO - __main__ - [55000] loss=1.13, steps/s=2.12
2019-06-27 00:55:44,326 - INFO - __main__ - [55000] evaL_f1=0.7771, max_f1=0.7771
0reactions
armancohancommented, Sep 13, 2019

Yes, I think the problem could be domain mismatch. This is helpful. Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

7 tricks to speed up the training of a neural network
A guide on how to speed up the training of a neural network and reduce the time in fitting the complex architectures.
Read more >
How to Speed Up Training for a New Hire
How to Speed Up Training for a New Hire · Create a Simple Training Outline · Keep It Short · Get the Right...
Read more >
How to speed up training of a Neural Network?
This paper talks about a training method where you train only a set of randomly chosen layers and drop the rest with identity...
Read more >
How to Run Faster: Speed Training Guide
Sample workout: Run one mile at a pace that's about 10 seconds slower per mile than your 5K race pace, then rest for...
Read more >
Speeding Up Neural Network Training with Data Echoing
Data echoing can speed up training whenever computation upstream from accelerators dominates training time. We measured the training speedup ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found