Resource exhausted: OOM when allocating tensor with shape[256,1114] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
See original GitHub issueHello,
I’m running the last version MusicVAE from repository on ubuntu 18.04, cuda 10.1, tensorflow 2.2.0, configuration - hier-trio_16bar and have the error below (i tried for different batch sizes, even 1 and for different learning rates, but the problem is the same). Do you know how to fix it?
2020-06-09 21:01:27.365621: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] Stats: Limit: 14684815360 InUse: 14684616704 MaxInUse: 14684815360 NumAllocs: 26588 MaxAllocSize: 181403648
2020-06-09 21:01:27.365991: W tensorflow/core/common_runtime/bfc_allocator.cc:439] ****************************************************************************************************
2020-06-09 21:01:27.366026: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at lstm_ops.cc:372 : Resource exhausted: OOM when allocating tensor with shape[256,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2020-06-09 21:01:34.147376: W tensorflow/core/kernels/data/cache_dataset_ops.cc:794] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat()
. You should use dataset.take(k).cache().repeat()
instead.
Traceback (most recent call last):
File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[256,1114] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node swap_in_core_decoder_1/core_decoder_0/decoder/while/BasicDecoderStep/decoder/multi_rnn_cell/cell_0/lstm_cell/LSTMBlockCell_13_0}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[add/_2901]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[256,1114] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node swap_in_core_decoder_1/core_decoder_0/decoder/while/BasicDecoderStep/decoder/multi_rnn_cell/cell_0/lstm_cell/LSTMBlockCell_13_0}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations. 0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File “music_vae_train.py”, line 340, in <module> console_entry_point() File “music_vae_train.py”, line 336, in console_entry_point tf.app.run(main) File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tensorflow/python/platform/app.py”, line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/absl/app.py”, line 299, in run _run_main(main, args) File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main sys.exit(main(argv)) File “music_vae_train.py”, line 331, in main run(configs.CONFIG_MAP) File “music_vae_train.py”, line 312, in run task=FLAGS.task) File “music_vae_train.py”, line 211, in train is_chief=is_chief) File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tf_slim/training/training.py”, line 551, in train loss = session.run(train_op, run_metadata=run_metadata) File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 778, in run run_metadata=run_metadata) File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1283, in run run_metadata=run_metadata) File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1384, in run raise six.reraise(*original_exc_info) File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/six.py”, line 703, in reraise raise value File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1369, in run return self._sess.run(*args, **kwargs) File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1442, in run run_metadata=run_metadata) File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1200, in run return self._sess.run(*args, **kwargs) File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 958, in run run_metadata_ptr) File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1181, in _run feed_dict_tensor, options, run_metadata) File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1359, in _do_run run_metadata) File “/home/burashnikova/env-tf22/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[256,1114] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node swap_in_core_decoder_1/core_decoder_0/decoder/while/BasicDecoderStep/decoder/multi_rnn_cell/cell_0/lstm_cell/LSTMBlockCell_13_0}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[add/_2901]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[256,1114] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node swap_in_core_decoder_1/core_decoder_0/decoder/while/BasicDecoderStep/decoder/multi_rnn_cell/cell_0/lstm_cell/LSTMBlockCell_13_0}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations. 0 derived errors ignored.
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (5 by maintainers)
Top GitHub Comments
This means your GPU does not have enough memory to support he model + batch sizes you’re using. Try reducing the batch size until it fits.
I understand. Training such models is often a challenge of patience. Either get a stronger GPU (or multiple) or be patient 🤗