cuda failed to allocate errors
See original GitHub issueWhen running a a training script using the new memory allocation backend (https://github.com/google/jax/issues/417), I see a bunch of non-fatal errors like this:
[1] 2019-05-29 23:55:55.555823: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 528.00M (553648128 bytes) from
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.581962: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/cudnn_conv_algorithm_picker.cc:525] Resource exhausted: Failed to
allocate request for 528.00MiB (553648128B) on device ordinal 0
[7] 2019-05-29 23:55:55.594693: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 528.00M (553648128 bytes) from
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[7] 2019-05-29 23:55:55.606314: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/cudnn_conv_algorithm_picker.cc:525] Resource exhausted: Failed to
allocate request for 528.00MiB (553648128B) on device ordinal 0
[1] 2019-05-29 23:55:55.633261: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.14G (1224736768 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.635169: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.05G (1132822528 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.646031: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 561.11M (588365824 bytes) from
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.647926: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 592.04M (620793856 bytes) from
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[7] 2019-05-29 23:55:55.655470: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.14G (1224736768 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Is this a known issue? The errors go away when using XLA_PYTHON_CLIENT_ALLOCATOR=platform
.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:32 (7 by maintainers)
Top Results From Across the Web
How to solve memory allocation problem in cuda??
But for the first time, I am facing a difficulty with an error called “Fatal error. Memory allocation cannot be possible”.
Read more >CUDA out-of-mem error - Chaos Help Center
This error message indicates that a project is too complex to be cached in the GPU's memory. Each project contains a certain amount...
Read more >Unable to allocate cuda memory, when there is enough of ...
Can someone please explain this: RuntimeError: CUDA out of memory. Tried to allocate 350.00 MiB (GPU 0; 7.93 GiB total capacity; ...
Read more >Resolving CUDA Being Out of Memory With Gradient ...
So when you try to execute the training, and you don't have enough free CUDA memory available, then the framework you're using throws...
Read more >Solving "CUDA out of memory" Error - Kaggle
Solving "CUDA out of memory" Error · 1) Use this code to see memory usage (it requires internet to install package): · 2)...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@christopherhesse if you update to the latest jaxlib (0.1.20, currently only available on Linux for now, let me know if need the Mac build), you should see fewer OOM messages. (https://github.com/tensorflow/tensorflow/commit/701f7e5a24590206c1ff32a50852f6cd040df1af reduces the amount of GPU memory needed in your script, and https://github.com/tensorflow/tensorflow/commit/84e3ae12ba9d6c64efae7884776810825bf82989 suppresses some spurious OOM log messages.) Give it a try?
There’s another issue that I haven’t addressed yet, which is that https://github.com/tensorflow/tensorflow/commit/805b7ccc2ec86c9dd59fa3550c57109a4a71c0d3 reduces GPU memory utilization (with the upshot that jax no longer allocate all your GPU memory up-front). I noticed that this makes your script OOM sooner than it does prior to that change. This is harder to fix; I might just add a toggle to reenable the old behavior for now. I’ll file a separate issue for this once I can better quantify how much worse the utilization is.
I ended up making it a WARNING, since it can have a significant performance impact. The change is to committed to XLA in https://github.com/tensorflow/tensorflow/commit/1423eab5e000c304f332c2a2a322bee76ca3fdfa, and will be included in the next jaxlib.
@mgbukov the error is referring to GPU memory and GPU convolution algorithms, so you won’t see it on CPU. You might also try the techniques for reducing GPU memory usage as described in https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html.