[RLlib] Low GPU utilization with Apex and IMPALA
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
- Ray installed from (source or binary): Binary (pip install ray[rllib] )
- Ray version: 0.7.5
- Python version: 3.5.2
- Exact command to reproduce: rl-experiments $ rllib train -f atari-apex/atari-apex.yaml $ rllib train -f pong-speedrun/pong-impala-fast.yaml rllib/tunes_examples $ rllib train -f tuned_examples/atari-apex.yaml
Describe the problem
I am hoping to accelerate RL algos for one of my problems. Given Apex and IMPALA seems to be the best fit for the cause, I tried the same with two system configurations (a) 4 CPUs, 1 (GTX 1070) GPU (b) 40 CPUs, 4 (V100) GPUs - DGX station. However, in either case, my GPU utilization seems to be quite low even with the examples shipped with Ray (in rllib/tuned_examples and rl-experiments repo). it is always less than 8% (intermittently only) and is 0% most of the time.
Is it expected behavior? There is no mention of GPU utilization levels in the results presented at https://github.com/ray-project/rl-experiments/blob/master/README.md too.
Kindly let me know if I am doing something wrong or anything I can try to improve the GPU utilization and accelerate RL training.
Source code / logs
Using FIFO scheduling algorithm.
Resources requested: 4/4 CPUs, 1/1 GPUs, 0.0/8.94 GiB heap, 0.0/3.08 GiB objects
Memory usage on this node: 8.1/15.6 GiB
Result logdir: /home/ankdesh/ray_results/pong-impala-fast
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
- IMPALA_PongNoFrameskip-v4_0: RUNNING, [4 CPUs, 1 GPUs], [pid=15082], 10 s, 1 iter, 15000 ts, -20.6 rew
nvidia-smi
ankdesh@6f012718bda7:~$ nvidia-smi
Wed Oct 2 07:11:55 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 On | 00000000:01:00.0 On | N/A |
| 0% 47C P8 15W / 200W | 515MiB / 8118MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top GitHub Comments
@ankdesh Hey I did some digging into this and I read through the original paper for ape-x and what I found was that in their setup they said that workers will run on cpu and only one learner network will run on GPU. So if that’s how it’s implemented in rllib maybe that’s why we see a lower usage on GPU . I haven’t gone through the ape-x code in rllib but I think that’s the case.
@ankdesh I am also facing the same problem, also running Tuned Examples the
pong-speedrun/pong-impala-fast.yaml
which should run in under 7 mins for 32 CPUs, I ran that for 1 Hour on 10 CPUs, still, it didn’t move from a negative reward. Also, my GPU usage is under 8%.