question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Low GPU-usage during training

See original GitHub issue

Hi, with sockeye2, the GPU-Util in gpustat or nvidia-smi is often low. e.g. as follows:

[0] Quadro RTX 8000  | 50'C,   0 % | 44481 / 48571 MB
[1] Quadro RTX 8000  | 51'C,   0 % | 44705 / 48571 MB
[2] Quadro RTX 8000  | 45'C,   4 % | 44743 / 48571 MB
[3] Quadro RTX 8000  | 50'C,   3 % | 44313 / 48571 MB
[4] Quadro RTX 8000  | 44'C,   7 % | 43667 / 48571 MB
[6] Quadro RTX 8000  | 51'C,  46 % | 44745 / 48571 MB
[7] Quadro RTX 8000  | 48'C,   7 % | 45187 / 48571 MB

The process is run via sockeye.train and not with horovodrun. Is this behaviour expected or are there certain hyperparameters that I have set in a very unoptimized fashion?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
fhiebercommented, Oct 6, 2020

Using multiple GPU devices on a single host with MXNet currently seems to not scale well. I would suggest setting up multi-gpu training with Horovod instead.

0reactions
graftimcommented, Sep 23, 2020

We now tested it with one GPU and reduced the batch-size to 1/8 of the previous size, and the usage is much higher. It never drops below 70% and also, the single GPU is around 15°C hotter than in the 8 GPU-Setup, so it must be some issue with scaling to multiple GPUs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Very low GPU usage during training in Tensorflow
I have 6Gb GPU memory and dataset is single channel 28x28 images. Am I really supposed to use such a big batch size?...
Read more >
What could a low GPU utilization mean when training a neural ...
It means that you don't have data to process on GPU. One reason can be IO as Tony Petrov wrote. Two other reasons...
Read more >
How to identify low GPU utilization due to small batch size
How to identify low GPU utilization due to small batch size · 1. Prepare training dataset · 2. Create a Training...
Read more >
Low GPU utilization while training with Keras - Kaggle
I am fine-tuning the Xception last block with 5000 images and I realized that my laptop (with GeForce GTX 1050) uses only 1-2%...
Read more >
Low GPU Usage during Training - PyTorch Forums
Hi! I am training a Convnet to classify CIFAR10 images on RTX 3080 GPU. For some reason, when I look at the GPU...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found