Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory requirements

See original GitHub issue

Hello, I am attempting to run this code:

python3 experiment.py --settings_file test

But I am running out of memory (OOM error):

2017-12-09 23:17:18.540786: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ***************************************************************************************************x
2017-12-09 23:17:18.540796: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[3988,3988]
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3988,3988]
	 [[Node: mul_790 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Neg_102, add_467)]]
	 [[Node: truediv_233/_165 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_216_truediv_233", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "experiment.py", line 221, in <module>
    mmd2, that_np = sess.run(mix_rbf_mmd2_and_ratio(eval_test_real, eval_test_sample,biased=False, sigmas=sigma))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3988,3988]
	 [[Node: mul_790 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Neg_102, add_467)]]
	 [[Node: truediv_233/_165 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_216_truediv_233", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'mul_790', defined at:
  File "experiment.py", line 221, in <module>
    mmd2, that_np = sess.run(mix_rbf_mmd2_and_ratio(eval_test_real, eval_test_sample,biased=False, sigmas=sigma))
  File "/home/jchook/dev/RGAN/mmd.py", line 71, in mix_rbf_mmd2_and_ratio
    K_XX, K_XY, K_YY, d = _mix_rbf_kernel(X, Y, sigmas, wts)
  File "/home/jchook/dev/RGAN/mmd.py", line 52, in _mix_rbf_kernel
    K_YY += wt * tf.exp(-gamma * (-2 * YY + c(Y_sqnorms) + r(Y_sqnorms)))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 894, in binary_op_wrapper
    return func(x, y, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 1117, in _mul_dispatch
    return gen_math_ops._mul(x, y, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 2726, in _mul
    "Mul", x=x, y=y, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3988,3988]
	 [[Node: mul_790 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Neg_102, add_467)]]
	 [[Node: truediv_233/_165 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_216_truediv_233", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

What are the minimum GPU memory requirements?

Issue Analytics

State:
Created 6 years ago
Comments:13

Top GitHub Comments

1reaction

corcracommented, Jan 24, 2018

The MMD score is only used for evaluation, so it shouldn’t affect training.

The main way it might affect you is that we use the MMD score (on the validation set) to decide when to save model parameters (https://github.com/ratschlab/RGAN/blob/master/experiment.py#L227), so without it you will default to the normal frequency, which is every 50 epochs (https://github.com/ratschlab/RGAN/blob/master/experiment.py#L273).

1reaction

corcracommented, Jan 22, 2018

You could also vary the size of the set used in evaluation (which gets fed into the MMD calculation), which is set on this line: https://github.com/ratschlab/RGAN/blob/master/experiment.py#L75 batch_multiplier is how many batches worth of data we want to include in the evaluation set.

The problem with reducing the evaluation set size is that it reduces the accuracy of the MMD calculation, but depending on your use case that may be an acceptable price to pay for the code actually running on your hardware. (I’m assuming based on your error log that the OOM is happening due to the MMD calculation, which is quadratic in the number of samples.)