Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RLlib] Low GPU utilization with Apex and IMPALA

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
Ray installed from (source or binary): Binary (pip install ray[rllib] )
Ray version: 0.7.5
Python version: 3.5.2
Exact command to reproduce: rl-experiments $ rllib train -f atari-apex/atari-apex.yaml $ rllib train -f pong-speedrun/pong-impala-fast.yaml rllib/tunes_examples $ rllib train -f tuned_examples/atari-apex.yaml

Describe the problem

I am hoping to accelerate RL algos for one of my problems. Given Apex and IMPALA seems to be the best fit for the cause, I tried the same with two system configurations (a) 4 CPUs, 1 (GTX 1070) GPU (b) 40 CPUs, 4 (V100) GPUs - DGX station. However, in either case, my GPU utilization seems to be quite low even with the examples shipped with Ray (in rllib/tuned_examples and rl-experiments repo). it is always less than 8% (intermittently only) and is 0% most of the time.

Is it expected behavior? There is no mention of GPU utilization levels in the results presented at https://github.com/ray-project/rl-experiments/blob/master/README.md too.

Kindly let me know if I am doing something wrong or anything I can try to improve the GPU utilization and accelerate RL training.

Source code / logs

Using FIFO scheduling algorithm.
Resources requested: 4/4 CPUs, 1/1 GPUs, 0.0/8.94 GiB heap, 0.0/3.08 GiB objects
Memory usage on this node: 8.1/15.6 GiB
Result logdir: /home/ankdesh/ray_results/pong-impala-fast
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - IMPALA_PongNoFrameskip-v4_0: RUNNING, [4 CPUs, 1 GPUs], [pid=15082], 10 s, 1 iter, 15000 ts, -20.6 rew

nvidia-smi

ankdesh@6f012718bda7:~$ nvidia-smi 
Wed Oct  2 07:11:55 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    On   | 00000000:01:00.0  On |                  N/A |
|  0%   47C    P8    15W / 200W |    515MiB /  8118MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Issue Analytics

State:
Created 4 years ago
Comments:5

Top GitHub Comments

2reactions

arunavo4commented, Oct 11, 2019

@ankdesh Hey I did some digging into this and I read through the original paper for ape-x and what I found was that in their setup they said that workers will run on cpu and only one learner network will run on GPU. So if that’s how it’s implemented in rllib maybe that’s why we see a lower usage on GPU . I haven’t gone through the ape-x code in rllib but I think that’s the case.

1reaction

arunavo4commented, Oct 2, 2019

@ankdesh I am also facing the same problem, also running Tuned Examples the pong-speedrun/pong-impala-fast.yaml which should run in under 7 mins for 32 CPUs, I ran that for 1 Hour on 10 CPUs, still, it didn’t move from a negative reward. Also, my GPU usage is under 8%.

Thu Oct  3 00:04:34 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    Off  | 00000000:29:00.0  On |                  N/A |
| 10%   53C    P2    38W / 160W |   1865MiB /  5931MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1142      G   /usr/lib/xorg/Xorg                            18MiB |
|    0      1186      G   /usr/bin/gnome-shell                          72MiB |
|    0      1475      G   /usr/lib/xorg/Xorg                           184MiB |
|    0      1621      G   /usr/bin/gnome-shell                         135MiB |
|    0      2007      G   ...s/pycharm-community-2019.2/jbr/bin/java     4MiB |
|    0      8812      G   ...uest-channel-token=17391706427186233250   164MiB |
|    0     27564      C   /usr/bin/python3                            1281MiB |
+-----------------------------------------------------------------------------

Top Results From Across the Web

Algorithms — Ray 2.2.0 - the Ray documentation

RLlib's multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive ...

RLlib Algorithms — Ray 0.7.5 documentation

Multi-GPU IMPALA scales up to solve PongNoFrameskip-v4 in ~3 minutes using ... or using Ape-X. Memory usage is reduced by compressing samples in...

Algorithms — Ray 1.11.0

RLlib's IMPALA implementation uses DeepMind's reference V-trace code. ... This will increase GPU memory usage proportionally with the # number of stacks.

RLlib Algorithms — Ray 0.8.7 documentation

Multi-GPU IMPALA scales up to solve PongNoFrameskip-v4 in ~3 minutes using a pair ... usage is reduced by compressing samples in the replay...

Getting Started with RLlib — Ray 2.2.0 - the Ray documentation

Resources requested: 4/4 CPUs, 0/0 GPUs Result logdir: ~/ray_results/my_experiment PENDING ... use time-efficient algorithms such as PPO, IMPALA, or APEX.