question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Non-deterministic behaviour when ran on GPU

See original GitHub issue

The following commit https://github.com/openai/baselines/commit/9fa8e1baf1d1f975b87b369a8082122eac812eb1#diff-fc3e1c3522d2c7871bda86ed40bcb0ddL28 introduced non-deterministic behavior of PPO1 when ran on GPU even with setting tf.set_random_seed (CPU behavior is deterministic). Specifically, at line 28 and others in mlp_policy.py replacing

U.dense(last_out, hid_size, name='fc%i'%(i+1), weight_init=U.normc_initializer(1.0))

with

tf.layers.dense(last_out, hid_size, name='fc%i'%(i+1), kernel_initializer=U.normc_initializer(1.0))

created this behavior. Below are 4 runs of Mujoco Swimmer-v2 environment with the same random seed using PPO1 in latest version of baselines code swimmer_same_seed_new_code.pdf

Replacing all instances of tf.layers.dense with U.dense, and adding the corresponding code

def dense(x, size, name, weight_init=None, bias=True):
   w = tf.get_variable(name + "/w", [x.get_shape()[1], size], initializer=weight_init)
   ret = tf.matmul(x, w)
   if bias:
       b = tf.get_variable(name + "/b", [size], initializer=tf.zeros_initializer())
       return ret + b
   else:
       return ret

back to tf_utils.py fixes the issue. Below is a figure with 4 Swimmer runs after this change swimmer_same_seed_old_code.pdf All experiments were run using tensorflow-gpu==1.12.0 cudatoolkit==9.2
cudnn==7.3.1

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
brett-daleycommented, Mar 3, 2019

GPU calculations are non-deterministic because the thread scheduling is non-deterministic. Floating-point errors are accumulated in unpredictable ways for operations that are not associative – a consequence of the GPU hardware itself, not TensorFlow.

This same phenomenon would occur on a multi-core CPU too, but I believe TensorFlow typically does not parallelize operations that lose determinism when using a CPU because the performance loss is minimal. This is why your CPU output is deterministic.

You can read these links for more info:

1reaction
brett-daleycommented, Apr 4, 2019

I think you should open a pull request with those changes (I can do it if you want). The owners can merge it if they approve it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to handle non-determinism when training on a GPU?
Deployment is another aspect of things, where it is often desirable to have a deterministic behavior, in part for human acceptance.
Read more >
A Workaround for Non-Determinism in TensorFlow - Two Sigma
This is because small differences due to GPU non-determinism can accumulate over the course of training, leading to rather different final models.
Read more >
FMADD non-deterministic? - CUDA - NVIDIA Developer Forums
Bottom line is CPU floating point is “nondeterministic” unless extra measures are taken to force determinism (at the expense of speed, of couse) ......
Read more >
Deterministic Execution on GPU Architectures
1.1 Motivation: Debugging with a Deterministic GPU . . . . . . 2 ... namic behavior of a GPU architecture throughout an...
Read more >
randomness in neural network training - arXiv
impact of said non-determinism, and the cost of eliminating different ... used GPU accelerator architectures, relative to non-deterministic.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found