Enabling GPU compromises learning performance
See original GitHub issueI’ve run several trials with TF Agents and found that enabling the GPU through the use_gpu
configuration flag stalls or inhibits task convergence. Any help troubleshooting this would be appreciated.
The problem seems to exist in all environments but is most prominent in the pendulum task. (Plots below.)
With GPU (4 runs):

Without GPU (4 runs):

These runs were generated with a fresh clone of the TF Agents repo as of this morning but previous versions showed similar results. The only difference between the two graphs is the use of the GPU.
It’s also about 3x slower to use the GPU on the pendulum task but I suspect that’s due to the relatively small size of the network vs the cost of data transfer to the GPU.
Also tested with:
- CUDNN 6 and 5
- TensorFlow 1.3.0 and 1.2.1
- Both tensorflow and tensorflow-gpu packages (no apparent difference b/w these two for CPU)
(This issue may be related to #8?)
cc @danijar
EDIT: GPU run logs: https://gist.github.com/jimfleming/0a163522f02ef9411a5b478099321497 CPU-only run logs: https://gist.github.com/jimfleming/e1eaafb720ee1ee969ea2f4a879ab17b
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:9 (5 by maintainers)
@alexpashevich Thanks for the pull request; I just merged it.
@jimfleming I’m closing this issue now. Please confirm if the changes solve your problem and re-open this issue if not.
I confirm the same problem using tf-nightly-gpu=1.5.0. It looks like that when allocating the network on GPU, the weights of the network are not updated immediately after applying the gradients.