Very low GPU usage when training on 8 GPU in a single machine
See original GitHub issueIssue Description
Hi, I am currently pretaining the BERT on my own data. I use the alpha0.0.1a5 branch (newest version).
I found only 20% of the GPU is in use.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:3F:00.0 Off | 0 |
| N/A 40C P0 58W / 300W | 10296MiB / 16152MiB | 32% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:40:00.0 Off | 0 |
| N/A 37C P0 55W / 300W | 2742MiB / 16152MiB | 23% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:41:00.0 Off | 0 |
| N/A 40C P0 58W / 300W | 2742MiB / 16152MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:42:00.0 Off | 0 |
| N/A 47C P0 61W / 300W | 2742MiB / 16152MiB | 24% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000000:62:00.0 Off | 0 |
| N/A 36C P0 98W / 300W | 2742MiB / 16152MiB | 17% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:63:00.0 Off | 0 |
| N/A 38C P0 88W / 300W | 2736MiB / 16152MiB | 23% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:64:00.0 Off | 0 |
| N/A 48C P0 80W / 300W | 2736MiB / 16152MiB | 25% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:65:00.0 Off | 0 |
| N/A 46C P0 71W / 300W | 2736MiB / 16152MiB | 24% Default |
+-------------------------------+----------------------+----------------------+
I am not familiar with pytorch. Any one konws why?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:13 (5 by maintainers)
Top Results From Across the Web
Why Is My GPU Usage So Low? 11 Causes and Fixes
Your GPU usage is very low because you're using the integrated graphics, there's a driver issue, you have a CPU bottleneck, or the...
Read more >What could a low GPU utilization mean when training a neural ...
It means that you don't have data to process on GPU. One reason can be IO as Tony Petrov wrote. Two other reasons...
Read more >What is the reason for low GPU util when training machine ...
It means that one experiment works only on one gpu though I use DP. And also, model is not really big. It only...
Read more >Efficient Training on a Single GPU - Hugging Face
This guide focuses on training large models efficiently on a single GPU. These approaches are still valid if you have access to a...
Read more >Monitor and Improve GPU Usage for Training Deep Learning ...
The danger of taking a single measurement is that GPU usage can ... with a lower learning rate and then ramping it up...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Try to move the criterion function in the forward function in bert.py and just return the loss of masked language model. (the output of the encoder is not used in the back),The speed will be more than doubled. and the use of gpu will be more than 70%-90%
I know the trick from a newbee colleague.
Sorry, I met OOM problem, but I don’t understand what you meant exactly, is there something different between your screenshot and the author’s origin code?