Low GPU-usage during training
See original GitHub issueHi, with sockeye2, the GPU-Util in gpustat or nvidia-smi is often low. e.g. as follows:
[0] Quadro RTX 8000 | 50'C, 0 % | 44481 / 48571 MB
[1] Quadro RTX 8000 | 51'C, 0 % | 44705 / 48571 MB
[2] Quadro RTX 8000 | 45'C, 4 % | 44743 / 48571 MB
[3] Quadro RTX 8000 | 50'C, 3 % | 44313 / 48571 MB
[4] Quadro RTX 8000 | 44'C, 7 % | 43667 / 48571 MB
[6] Quadro RTX 8000 | 51'C, 46 % | 44745 / 48571 MB
[7] Quadro RTX 8000 | 48'C, 7 % | 45187 / 48571 MB
The process is run via sockeye.train and not with horovodrun. Is this behaviour expected or are there certain hyperparameters that I have set in a very unoptimized fashion?
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (6 by maintainers)
Top Results From Across the Web
Very low GPU usage during training in Tensorflow
I have 6Gb GPU memory and dataset is single channel 28x28 images. Am I really supposed to use such a big batch size?...
Read more >What could a low GPU utilization mean when training a neural ...
It means that you don't have data to process on GPU. One reason can be IO as Tony Petrov wrote. Two other reasons...
Read more >How to identify low GPU utilization due to small batch size
How to identify low GPU utilization due to small batch size · 1. Prepare training dataset · 2. Create a Training...
Read more >Low GPU utilization while training with Keras - Kaggle
I am fine-tuning the Xception last block with 5000 images and I realized that my laptop (with GeForce GTX 1050) uses only 1-2%...
Read more >Low GPU Usage during Training - PyTorch Forums
Hi! I am training a Convnet to classify CIFAR10 images on RTX 3080 GPU. For some reason, when I look at the GPU...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Using multiple GPU devices on a single host with MXNet currently seems to not scale well. I would suggest setting up multi-gpu training with Horovod instead.
We now tested it with one GPU and reduced the batch-size to 1/8 of the previous size, and the usage is much higher. It never drops below 70% and also, the single GPU is around 15°C hotter than in the 8 GPU-Setup, so it must be some issue with scaling to multiple GPUs.