Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cuda failed to allocate errors

See original GitHub issue

When running a a training script using the new memory allocation backend (https://github.com/google/jax/issues/417), I see a bunch of non-fatal errors like this:

[1] 2019-05-29 23:55:55.555823: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 528.00M (553648128 bytes) from 
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.581962: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/cudnn_conv_algorithm_picker.cc:525] Resource exhausted: Failed to 
allocate request for 528.00MiB (553648128B) on device ordinal 0
[7] 2019-05-29 23:55:55.594693: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 528.00M (553648128 bytes) from 
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[7] 2019-05-29 23:55:55.606314: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/cudnn_conv_algorithm_picker.cc:525] Resource exhausted: Failed to 
allocate request for 528.00MiB (553648128B) on device ordinal 0
[1] 2019-05-29 23:55:55.633261: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.14G (1224736768 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.635169: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.05G (1132822528 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.646031: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 561.11M (588365824 bytes) from 
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.647926: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 592.04M (620793856 bytes) from 
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[7] 2019-05-29 23:55:55.655470: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.14G (1224736768 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Is this a known issue? The errors go away when using XLA_PYTHON_CLIENT_ALLOCATOR=platform.

Issue Analytics

State:
Created 4 years ago
Reactions:4
Comments:32 (7 by maintainers)

Top GitHub Comments

2reactions

skyecommented, Jun 13, 2019

@christopherhesse if you update to the latest jaxlib (0.1.20, currently only available on Linux for now, let me know if need the Mac build), you should see fewer OOM messages. (https://github.com/tensorflow/tensorflow/commit/701f7e5a24590206c1ff32a50852f6cd040df1af reduces the amount of GPU memory needed in your script, and https://github.com/tensorflow/tensorflow/commit/84e3ae12ba9d6c64efae7884776810825bf82989 suppresses some spurious OOM log messages.) Give it a try?

There’s another issue that I haven’t addressed yet, which is that https://github.com/tensorflow/tensorflow/commit/805b7ccc2ec86c9dd59fa3550c57109a4a71c0d3 reduces GPU memory utilization (with the upshot that jax no longer allocate all your GPU memory up-front). I noticed that this makes your script OOM sooner than it does prior to that change. This is harder to fix; I might just add a toggle to reenable the old behavior for now. I’ll file a separate issue for this once I can better quantify how much worse the utilization is.

1reaction

skyecommented, Nov 12, 2019

I ended up making it a WARNING, since it can have a significant performance impact. The change is to committed to XLA in https://github.com/tensorflow/tensorflow/commit/1423eab5e000c304f332c2a2a322bee76ca3fdfa, and will be included in the next jaxlib.

@mgbukov the error is referring to GPU memory and GPU convolution algorithms, so you won’t see it on CPU. You might also try the techniques for reducing GPU memory usage as described in https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html.