question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Out of memory error with NVIDIA K80 GPU

See original GitHub issue

Trying to create an image classifier with ~1000 training samples and 7 classes but it throws a runtime error. Is there a way of reducing batch size or something else that can be done to circumvent this?

Following is the error.

RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58/usr/lib/python3.5/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 2 leaked semaphores to clean up at shutdown len(cache))

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:23 (7 by maintainers)

github_iconTop GitHub Comments

3reactions
sparkdoccommented, Aug 14, 2018

When I first ran this with about 550 128x128 grayscale images using a Quadro P4000 with 8 GB of memory, it immediately crashed due to insufficient memory. I adjusted the constant.MAX_BATCH_SIZE parameter from the default of 128 down to 32, and then it worked for about an hour until crashing again. The error message was: RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

I was watching the GPU memory usage before it crashed, and it fluctuated in cycles as expected for a “grid search” sort of activity. Unfortunately, it looks like the peak memory usage corresponding to the more memory-intensive models progressively increase until overwhelming the GPU memory.

Maybe it would be good, upon initialization of the program, to quantify the available memory and then cap the model search to models that fit within that limit. If the program determines that it cannot identify an optimal model within that constraint, and may require more memory, it could output such a message and hints as to how to accomplish this (i.e., smaller batches, smaller images, larger GPU memory, etc…). It might also help to offer a grayscale option in the load_image_dataset method that reduces a color image from three color channels to one grayscale channel.

also, what is the LIMIT_MEMORY parameter?

1reaction
haifeng-jincommented, Aug 24, 2018

This issue is fixed in the new release. Thank you all for the contribution.

Read more comments on GitHub >

github_iconTop Results From Across the Web

K80 crashed or wrong computation results on K80
but I get correct computation results when using GTX 680 while get K80 crashed (maybe memory violation) or obtain wrong computation from K80....
Read more >
Memory allocation problem with multi-gpu (Tesla k80 ...
It seems unified memory access create problem if memory allocated on multi-gpu from two diffrenet processes which use different devices.( ...
Read more >
Plugging Tesla K80 results in PCI resource allocation error
Hi, I bought a Tesla K80 card and tried to integrate it into a workstation PC (of course with sufficient ventilation).
Read more >
Tesla K80 size problem - CUDA Programming and Performance
My GPU has a maximum threads per block of 1024. The memory allocation on the GPU is performed to fit my kernel inputs...
Read more >
K80 GPU disappears when tries to run 2 TensorFlow ...
We setup the BIOS to recognise the GPU memory: 83:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) Subsystem: NVIDIA ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found