question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cuda failed to allocate errors

See original GitHub issue

When running a a training script using the new memory allocation backend (https://github.com/google/jax/issues/417), I see a bunch of non-fatal errors like this:

[1] 2019-05-29 23:55:55.555823: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 528.00M (553648128 bytes) from 
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.581962: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/cudnn_conv_algorithm_picker.cc:525] Resource exhausted: Failed to 
allocate request for 528.00MiB (553648128B) on device ordinal 0
[7] 2019-05-29 23:55:55.594693: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 528.00M (553648128 bytes) from 
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[7] 2019-05-29 23:55:55.606314: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/cudnn_conv_algorithm_picker.cc:525] Resource exhausted: Failed to 
allocate request for 528.00MiB (553648128B) on device ordinal 0
[1] 2019-05-29 23:55:55.633261: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.14G (1224736768 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.635169: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.05G (1132822528 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.646031: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 561.11M (588365824 bytes) from 
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1] 2019-05-29 23:55:55.647926: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 592.04M (620793856 bytes) from 
device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[7] 2019-05-29 23:55:55.655470: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 1.14G (1224736768 bytes) from d
evice: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Is this a known issue? The errors go away when using XLA_PYTHON_CLIENT_ALLOCATOR=platform.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:4
  • Comments:32 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
skyecommented, Jun 13, 2019

@christopherhesse if you update to the latest jaxlib (0.1.20, currently only available on Linux for now, let me know if need the Mac build), you should see fewer OOM messages. (https://github.com/tensorflow/tensorflow/commit/701f7e5a24590206c1ff32a50852f6cd040df1af reduces the amount of GPU memory needed in your script, and https://github.com/tensorflow/tensorflow/commit/84e3ae12ba9d6c64efae7884776810825bf82989 suppresses some spurious OOM log messages.) Give it a try?

There’s another issue that I haven’t addressed yet, which is that https://github.com/tensorflow/tensorflow/commit/805b7ccc2ec86c9dd59fa3550c57109a4a71c0d3 reduces GPU memory utilization (with the upshot that jax no longer allocate all your GPU memory up-front). I noticed that this makes your script OOM sooner than it does prior to that change. This is harder to fix; I might just add a toggle to reenable the old behavior for now. I’ll file a separate issue for this once I can better quantify how much worse the utilization is.

1reaction
skyecommented, Nov 12, 2019

I ended up making it a WARNING, since it can have a significant performance impact. The change is to committed to XLA in https://github.com/tensorflow/tensorflow/commit/1423eab5e000c304f332c2a2a322bee76ca3fdfa, and will be included in the next jaxlib.

@mgbukov the error is referring to GPU memory and GPU convolution algorithms, so you won’t see it on CPU. You might also try the techniques for reducing GPU memory usage as described in https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to solve memory allocation problem in cuda??
But for the first time, I am facing a difficulty with an error called “Fatal error. Memory allocation cannot be possible”.
Read more >
CUDA out-of-mem error - Chaos Help Center
This error message indicates that a project is too complex to be cached in the GPU's memory. Each project contains a certain amount...
Read more >
Unable to allocate cuda memory, when there is enough of ...
Can someone please explain this: RuntimeError: CUDA out of memory. Tried to allocate 350.00 MiB (GPU 0; 7.93 GiB total capacity; ...
Read more >
Resolving CUDA Being Out of Memory With Gradient ...
So when you try to execute the training, and you don't have enough free CUDA memory available, then the framework you're using throws...
Read more >
Solving "CUDA out of memory" Error - Kaggle
Solving "CUDA out of memory" Error · 1) Use this code to see memory usage (it requires internet to install package): · 2)...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found