Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Low GPU-usage during training

See original GitHub issue

Hi, with sockeye2, the GPU-Util in gpustat or nvidia-smi is often low. e.g. as follows:

[0] Quadro RTX 8000  | 50'C,   0 % | 44481 / 48571 MB
[1] Quadro RTX 8000  | 51'C,   0 % | 44705 / 48571 MB
[2] Quadro RTX 8000  | 45'C,   4 % | 44743 / 48571 MB
[3] Quadro RTX 8000  | 50'C,   3 % | 44313 / 48571 MB
[4] Quadro RTX 8000  | 44'C,   7 % | 43667 / 48571 MB
[6] Quadro RTX 8000  | 51'C,  46 % | 44745 / 48571 MB
[7] Quadro RTX 8000  | 48'C,   7 % | 45187 / 48571 MB

The process is run via sockeye.train and not with horovodrun. Is this behaviour expected or are there certain hyperparameters that I have set in a very unoptimized fashion?

Issue Analytics

State:
Created 3 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

fhiebercommented, Oct 6, 2020

Using multiple GPU devices on a single host with MXNet currently seems to not scale well. I would suggest setting up multi-gpu training with Horovod instead.

0reactions

graftimcommented, Sep 23, 2020

We now tested it with one GPU and reduced the batch-size to 1/8 of the previous size, and the usage is much higher. It never drops below 70% and also, the single GPU is around 15°C hotter than in the 8 GPU-Setup, so it must be some issue with scaling to multiple GPUs.