Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Low GPU utilization with tfjs-node-gpu

See original GitHub issue

TensorFlow.js version

  "dependencies": {
    "@tensorflow/tfjs": "^0.11.4",
    "@tensorflow/tfjs-node": "^0.1.5",
    "@tensorflow/tfjs-node-gpu": "^0.1.7",
}

Browser version

N/A. Node v8.9.4. Ubuntu 16.04

Describe the problem or feature request

Using tfjs-node-gpu, I can’t seem to get GPU utilization above ~0-3%. I have CUDA 9 and CuDNN 7.1 installed, am importing @tensorflow/tfjs-node-gpu, and am setting the “tensorflow” backend with tf.setBackend('tensorflow'). CPU usage is at 100% on one core, but GPU utilization is practically none. I’ve tried tfjs-examples/baseball-node (replacing import'@tensorflow/tfjs-node' with import'@tensorflow/tfjs-node-gpu' of course) as well as my own custom LSTM code. Does tfjs-node-gpu actually run processes on the GPU?

Code to reproduce the bug / link to feature request

# assumes CUDA 9, CuDNN 7.1, and latest nvidia drivers are already installed
git clone https://github.com/tensorflow/tfjs-examples
cd tfjs-examples/baseball-node

# replace tfjs-node import with tfjs-node-gpu
sed -i s/tfjs-node/tfjs-node-gpu/ src/server/server.ts

# install dependencies and download data
yarn add @tensorflow/tfjs-node-gpu
yarn && yarn download-data

# start the server
yarn start-server

Now open another terminal and watch GPU usage. Note that if you are running the process on the same GPU as an X window server GPU usage will likely be greater than 3% because of that process. I’ve tested this on a dedicated GPU running no other processes using the CUDA_VISIBLE_DEVICES env var.

# monitor GPU utilization
watch -n 0.1 nvidia-smi

Issue Analytics

State:
Created 5 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

4reactions

brannondorseycommented, Jun 28, 2018

Gotcha. Thanks for that clarification. I’ve revisited the char-rnn tfjs-node-gpu example I was telling you about and it looks like it is indeed running on the GPU as memory is allocated, but GPU utilization is ~1%. If I’m understanding you correctly this is because tfjs-node-gpu is using TF Eager mode. So I should expect the same type of model to run ~1 GPU utilization if it were written in Python using TF Eager mode as well, correct?

Does tfjs-node-gpu intend to add support for graph-based execution at some point in the near future? Unless I’m missing something, this “Eager mode only” behavior creates some significance performance hurdles, no? In general, how does tfjs-node-gpu compare in performance to similar implementations in Keras?

I ask because I’m writing some documentation for my team and am beginning to consider a javascript-first approach to common high-level ML tasks. A year ago that would have seemed like a crazy idea, but with tfjs, maybe not so. Basically I’m curious if tfjs-node-gpu will ever be comparable in performance to Keras and Python Tensorflow?

0reactions

f4z3k4scommented, Jan 27, 2022

We actually experience the same. Running our model on CPU takes ~400ms, running it on GPU takes ~3000ms. This happens on a server with two NVIDIA GeForce RTX 3090 and cuda 11.6 with cudnn 8.3. Relevant logs:

2022-01-27 22:48:03.044007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 19758 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:65:00.0, compute capability: 8.6
2022-01-27 22:48:03.044598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22307 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:b4:00.0, compute capability: 8.6
2022-01-27 22:48:04.985189: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8302
2022-01-27 22:48:06.383271: I tensorflow/stream_executor/cuda/cuda_blas.cc:1774] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

I can confrim that cuda is installed well as I am able to utilize it with several other tools correctly.

This does not happen in the browser though, running on WebGL is way faster than CPU inference.

UPDATE: I actually have to admit, that I was only testing these by only doing 1 inference instead of 100s or 1000s. I created test suites for larger magnitudes of inference, and it's actually true that copying the model to GPU memory is what takes a lot of time. After that's done, GPU inference is way faster than CPU inference:

GPU info:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01    Driver Version: 510.39.01    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:65:00.0 Off |                  N/A |
|  0%   26C    P8    34W / 390W |   2552MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:B4:00.0 Off |                  N/A |
|  0%   28C    P8    24W / 350W |      3MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    360790      C   ...9TtSrW0h-py3.7/bin/python     2549MiB |

CPU info:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          24
On-line CPU(s) list:             0-23
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
Stepping:                        4
CPU MHz:                         1000.089
CPU max MHz:                     3200.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        4600.00
Virtualization:                  VT-x
L1d cache:                       384 KiB
L1i cache:                       384 KiB
L2 cache:                        12 MiB
L3 cache:                        16.5 MiB
NUMA node0 CPU(s):               0-23

Following were the results for averaging 100 inferences on a hot GPU (model is loaded to GPU memory and not disposed between model.execute calls):