question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Enabling GPU compromises learning performance

See original GitHub issue

I’ve run several trials with TF Agents and found that enabling the GPU through the use_gpu configuration flag stalls or inhibits task convergence. Any help troubleshooting this would be appreciated.

The problem seems to exist in all environments but is most prominent in the pendulum task. (Plots below.)

With GPU (4 runs):

screen shot 2017-10-03 at 2 18 04 pm

Without GPU (4 runs):

screen shot 2017-10-03 at 2 20 14 pm

These runs were generated with a fresh clone of the TF Agents repo as of this morning but previous versions showed similar results. The only difference between the two graphs is the use of the GPU.

It’s also about 3x slower to use the GPU on the pendulum task but I suspect that’s due to the relatively small size of the network vs the cost of data transfer to the GPU.

Also tested with:

  • CUDNN 6 and 5
  • TensorFlow 1.3.0 and 1.2.1
  • Both tensorflow and tensorflow-gpu packages (no apparent difference b/w these two for CPU)

(This issue may be related to #8?)

cc @danijar

EDIT: GPU run logs: https://gist.github.com/jimfleming/0a163522f02ef9411a5b478099321497 CPU-only run logs: https://gist.github.com/jimfleming/e1eaafb720ee1ee969ea2f4a879ab17b

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
danijarcommented, Dec 20, 2017

@alexpashevich Thanks for the pull request; I just merged it.

@jimfleming I’m closing this issue now. Please confirm if the changes solve your problem and re-open this issue if not.

1reaction
alexpashevichcommented, Nov 13, 2017

I confirm the same problem using tf-nightly-gpu=1.5.0. It looks like that when allocating the network on GPU, the weights of the network are not updated immediately after applying the gradients.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The business case for using GPUs to accelerate analytics ...
GPUs are especially beneficial for vector calculations common in data science, particularly machine learning. The combination of larger data ...
Read more >
A guide to GPU implementation and activation - TechTarget
There are two main ways to enable GPU resources. The first approach is to install the GPU subsystem as an aftermarket upgrade onto...
Read more >
GPU-Powered Deep Learning Inference Acceleration
Bottom line: you get a 75% cost reduction for equivalent GPU performance, while picking the exact instance type that fits your application.
Read more >
Could This Be The Beginning Of No Compromise Graphics ...
With the 7900 series graphics cards, AMD is attempting to change the game by promising peak performance even with advanced features active and ......
Read more >
Choosing the Best GPU for Deep Learning in 2020
State-of-the-art (SOTA) deep learning models have massive memory footprints. Many GPUs don't have enough VRAM to train them. In this post, we ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found