Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Non-deterministic behaviour when ran on GPU

See original GitHub issue

The following commit https://github.com/openai/baselines/commit/9fa8e1baf1d1f975b87b369a8082122eac812eb1#diff-fc3e1c3522d2c7871bda86ed40bcb0ddL28 introduced non-deterministic behavior of PPO1 when ran on GPU even with setting tf.set_random_seed (CPU behavior is deterministic). Specifically, at line 28 and others in mlp_policy.py replacing

U.dense(last_out, hid_size, name='fc%i'%(i+1), weight_init=U.normc_initializer(1.0))

with

tf.layers.dense(last_out, hid_size, name='fc%i'%(i+1), kernel_initializer=U.normc_initializer(1.0))

created this behavior. Below are 4 runs of Mujoco Swimmer-v2 environment with the same random seed using PPO1 in latest version of baselines code swimmer_same_seed_new_code.pdf

Replacing all instances of tf.layers.dense with U.dense, and adding the corresponding code

def dense(x, size, name, weight_init=None, bias=True):
   w = tf.get_variable(name + "/w", [x.get_shape()[1], size], initializer=weight_init)
   ret = tf.matmul(x, w)
   if bias:
       b = tf.get_variable(name + "/b", [size], initializer=tf.zeros_initializer())
       return ret + b
   else:
       return ret

back to tf_utils.py fixes the issue. Below is a figure with 4 Swimmer runs after this change swimmer_same_seed_old_code.pdf All experiments were run using tensorflow-gpu==1.12.0 cudatoolkit==9.2
cudnn==7.3.1

Issue Analytics

State:
Created 5 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

3reactions

brett-daleycommented, Mar 3, 2019

GPU calculations are non-deterministic because the thread scheduling is non-deterministic. Floating-point errors are accumulated in unpredictable ways for operations that are not associative – a consequence of the GPU hardware itself, not TensorFlow.

This same phenomenon would occur on a multi-core CPU too, but I believe TensorFlow typically does not parallelize operations that lose determinism when using a CPU because the performance loss is minimal. This is why your CPU output is deterministic.

You can read these links for more info:

1reaction

brett-daleycommented, Apr 4, 2019

I think you should open a pull request with those changes (I can do it if you want). The owners can merge it if they approve it.

Top Results From Across the Web

How to handle non-determinism when training on a GPU?

Deployment is another aspect of things, where it is often desirable to have a deterministic behavior, in part for human acceptance.

A Workaround for Non-Determinism in TensorFlow - Two Sigma

This is because small differences due to GPU non-determinism can accumulate over the course of training, leading to rather different final models.

FMADD non-deterministic? - CUDA - NVIDIA Developer Forums

Bottom line is CPU floating point is “nondeterministic” unless extra measures are taken to force determinism (at the expense of speed, of couse) ......

Deterministic Execution on GPU Architectures

1.1 Motivation: Debugging with a Deterministic GPU . . . . . . 2 ... namic behavior of a GPU architecture throughout an...

randomness in neural network training - arXiv

impact of said non-determinism, and the cost of eliminating different ... used GPU accelerator architectures, relative to non-deterministic.