question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Very low GPU usage when training on 8 GPU in a single machine

See original GitHub issue

Hi, I am currently pretaining the BERT on my own data. I use the alpha0.0.1a5 branch (newest version).
I found only 20% of the GPU is in use.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:3F:00.0 Off |                    0 |
| N/A   40C    P0    58W / 300W |  10296MiB / 16152MiB |     32%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:40:00.0 Off |                    0 |
| N/A   37C    P0    55W / 300W |   2742MiB / 16152MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   40C    P0    58W / 300W |   2742MiB / 16152MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:42:00.0 Off |                    0 |
| N/A   47C    P0    61W / 300W |   2742MiB / 16152MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   36C    P0    98W / 300W |   2742MiB / 16152MiB |     17%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:63:00.0 Off |                    0 |
| N/A   38C    P0    88W / 300W |   2736MiB / 16152MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:64:00.0 Off |                    0 |
| N/A   48C    P0    80W / 300W |   2736MiB / 16152MiB |     25%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:65:00.0 Off |                    0 |
| N/A   46C    P0    71W / 300W |   2736MiB / 16152MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+

I am not familiar with pytorch. Any one konws why?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:13 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
wq343580510commented, Nov 5, 2018

Try to move the criterion function in the forward function in bert.py and just return the loss of masked language model. (the output of the encoder is not used in the back),The speed will be more than doubled. and the use of gpu will be more than 70%-90% image

I know the trick from a newbee colleague.

1reaction
Huntersxsxcommented, Apr 20, 2020

Try to move the criterion function in the forward function in bert.py and just return the loss of masked language model. (the output of the encoder is not used in the back),The speed will be more than doubled. and the use of gpu will be more than 70%-90% image

I know the trick from a newbee colleague.

Sorry, I met OOM problem, but I don’t understand what you meant exactly, is there something different between your screenshot and the author’s origin code?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why Is My GPU Usage So Low? 11 Causes and Fixes
Your GPU usage is very low because you're using the integrated graphics, there's a driver issue, you have a CPU bottleneck, or the...
Read more >
What could a low GPU utilization mean when training a neural ...
It means that you don't have data to process on GPU. One reason can be IO as Tony Petrov wrote. Two other reasons...
Read more >
What is the reason for low GPU util when training machine ...
It means that one experiment works only on one gpu though I use DP. And also, model is not really big. It only...
Read more >
Efficient Training on a Single GPU - Hugging Face
This guide focuses on training large models efficiently on a single GPU. These approaches are still valid if you have access to a...
Read more >
Monitor and Improve GPU Usage for Training Deep Learning ...
The danger of taking a single measurement is that GPU usage can ... with a lower learning rate and then ramping it up...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found