Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPU usage

See original GitHub issue

Hi all, I tested the training example in readme. I found that the volatile GPU-util of almost all GPUs are 0% except the first one but took all GPU memories. I’m not sure whether it’s a tensorflow or tensor2tensor error.

Thank you

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      Off  | 0000:04:00.0     Off |                    0 |
| N/A   56C    P0   187W / 250W |  21871MiB / 22939MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40 24GB      Off  | 0000:05:00.0     Off |                    0 |
| N/A   28C    P0    56W / 250W |  21806MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M40 24GB      Off  | 0000:08:00.0     Off |                    0 |
| N/A   28C    P0    55W / 250W |  21804MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M40 24GB      Off  | 0000:09:00.0     Off |                    0 |
| N/A   29C    P0    55W / 250W |  21804MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla M40 24GB      Off  | 0000:86:00.0     Off |                    0 |
| N/A   29C    P0    56W / 250W |  21808MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla M40 24GB      Off  | 0000:87:00.0     Off |                    0 |
| N/A   27C    P0    57W / 250W |  21806MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla M40 24GB      Off  | 0000:8A:00.0     Off |                    0 |
| N/A   30C    P0    57W / 250W |  21804MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla M40 24GB      Off  | 0000:8B:00.0     Off |                    0 |
| N/A   27C    P0    56W / 250W |  21804MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Issue Analytics

State:
Created 6 years ago
Comments:16 (13 by maintainers)

Top GitHub Comments

31reactions

lukaszkaisercommented, Jun 22, 2017

A single step runs 1 batch on each GPU. So it’s always slower with more GPUs: in addition to running it on every GPU, you need them all to sync. But you’re running 8x more examples effectively.

Our batch_size might be a bit misleading: it’s calculated (1) per-token and (2) per-gpu. To understand (1), assume you’re first processing a batch of sentences of 15 words each, and then another, of sentences of 40 words each. If you keep the same batch size, your memory use in the second case (40 words) will be over 2x more than in the first case (15 words), as all hidden activations have a length dimension. So either you set a low batch_size and under-utilize your GPU in the 15-words case, or you try to set a high one, and possibly get an OOM in the 40-words case. That’s why we have a per-token batch size: it’s variable depending on token length. If batch_size=4096 then if the sentence has 15 words, we’ll actually get a batch of 4096 // 15 = 273 sentences. But if the length is 40, we’ll take a batch of 4096 // 40 = 102 sentences. But, coming to (2), this is per-gpu. If you have another GPU, you can easily process another batch of the same size there. We want to avoid changing batch sizes when we use more GPUs, because we sometimes test a model on 1 or 2 GPUs and then run it on more. It’s helpful to not have to change the hyperparameters each time when running on different hardware. That’s why a batch_size = 4096 actually means you’re running 4096 tokens (not sentences) on each GPU you have. So 0.912966 * 1 * 4096 = 3740 tokens / s in 1-GPU case, and 0.762162 * 8 * 4096 = 24975 tokens / s on 8 GPUs.

Hope that helps to understand it!

0reactions

EthannyDingcommented, May 17, 2019

I have the same problem that my sole gpu usage is close to 0 and cpu usage is high while training for tensor2tensor model. I’m using the following installations: tensor2tensor==1.7.0 tensorboard==1.13.1 tensorflow==1.13.1 tensorflow-estimator==1.13.0 tensorflow-gpu==1.13.1 tensorflow-tensorboard==1.5.1 I tried to uninstall tensorflow and kept tensorflow-gpu and it reported an error that ModuleNotFoundError: No module named ‘tensorflow.python’ Is anyone know what is going wrong in the training??