question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CSV input produces OOM (out of memory) error on GPU

See original GitHub issue

Hi I am using my image dataset with CSV files as described in the README file. The training starts and runs fine on CPU, but it produces an OOM message on GPU:

ResourceExhaustedError: OOM when allocating tensor with shape[216350,462] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: training_1/Adam/gradients/loss_1/classification_loss/pow_grad/Pow = Pow[T=DT_FLOAT, _class=["loc:@training_1/Adam/gradients/loss_1/classification_loss/pow_grad/Reshape"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](loss_1/classification_loss/Select_1, training_1/Adam/gradients/loss_1/regression_loss/Pow_grad/sub)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: loss_1/classification_loss/Mean_2/_755 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5547_loss_1/classification_loss/Mean_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

when using VGG16 backbone.

and with resnet50:

ResourceExhaustedError: OOM when allocating tensor with shape[171704,462] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: loss_2/classification_loss/logistic_loss/zeros_like = ZerosLike[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](loss_2/classification_loss/Log)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

any help would be very appreciated

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
hgaisercommented, Jun 13, 2018

One step processes one batch. The question is, what is an epoch. I’ve seen it in many cases be defined as “having processed all data once”, but what does that mean really? The network doesn’t care if it has seen all images or not.

Practically speaking, an epoch is just a “period of time”, or in the case of deep learning “a number of steps”. This has some benefits to configure it as a number of steps, regardless of the number of images in the dataset. COCO has more than 100k images, if we would set step size to 100k it would take a long time before it stores a snapshot and inspects the results. Setting the step size to 10k gives much more feedback on the progress of training. Additionally, when enabling data augmentation through random transformations, what do you define as “all data”? Using all augmentation parameters, the dataset becomes enormous.

Anyway, your issue is resolved, so I’ll close this. For further discussions I suggest to join the Slack channel.

0reactions
patagona-snayyercommented, Jul 12, 2020

Hi Hgaiser Thanks a gain. I have selected a maximum image size of 1000 pxls and that seems to work for now. Anything above 1000 pxls, and the OOM appears. However, I have trained Faster RCNN with vgg16 backbone in Matlab with image sizes upto 3000 pxls on the same GPU and it worked fine. Any idea as to why it is so different in your implementation?

How do you change image size? And would that mean I need to change annotations too?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why am I getting GPU ran out of memory error here?
Usually, when OOM errors take place, it is because the batch_size is too big or your VRAM is too small. In your case,...
Read more >
Getting memory error when training a larger dataset on the GPU
This error message is generated if the memory is not sufficient to manage the batch size.
Read more >
Working with GPU - fastai
This GPU memory is not accessible to your program's needs and it's not re-usable between processes. If you run two processes, each executing...
Read more >
NCCL WARN Cuda failure 'out of memory' after multiple hours ...
Hey, I'm using a single node with 4 T4 GPUs and getting gpu074:31329:31756 [3] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory' ...
Read more >
14 Simple Tips to save RAM memory for 1+GB dataset - Kaggle
For medium-sized data, we're better off trying to get more out of pandas, ... sample_submission.csv - a sample submission file in the correct...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found