Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CSV input produces OOM (out of memory) error on GPU

See original GitHub issue

Hi I am using my image dataset with CSV files as described in the README file. The training starts and runs fine on CPU, but it produces an OOM message on GPU:

ResourceExhaustedError: OOM when allocating tensor with shape[216350,462] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: training_1/Adam/gradients/loss_1/classification_loss/pow_grad/Pow = Pow[T=DT_FLOAT, _class=["loc:@training_1/Adam/gradients/loss_1/classification_loss/pow_grad/Reshape"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](loss_1/classification_loss/Select_1, training_1/Adam/gradients/loss_1/regression_loss/Pow_grad/sub)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: loss_1/classification_loss/Mean_2/_755 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5547_loss_1/classification_loss/Mean_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

when using VGG16 backbone.

and with resnet50:

ResourceExhaustedError: OOM when allocating tensor with shape[171704,462] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: loss_2/classification_loss/logistic_loss/zeros_like = ZerosLike[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](loss_2/classification_loss/Log)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

any help would be very appreciated

Issue Analytics

State:
Created 5 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

hgaisercommented, Jun 13, 2018

One step processes one batch. The question is, what is an epoch. I’ve seen it in many cases be defined as “having processed all data once”, but what does that mean really? The network doesn’t care if it has seen all images or not.

Practically speaking, an epoch is just a “period of time”, or in the case of deep learning “a number of steps”. This has some benefits to configure it as a number of steps, regardless of the number of images in the dataset. COCO has more than 100k images, if we would set step size to 100k it would take a long time before it stores a snapshot and inspects the results. Setting the step size to 10k gives much more feedback on the progress of training. Additionally, when enabling data augmentation through random transformations, what do you define as “all data”? Using all augmentation parameters, the dataset becomes enormous.

Anyway, your issue is resolved, so I’ll close this. For further discussions I suggest to join the Slack channel.

0reactions

patagona-snayyercommented, Jul 12, 2020

Hi Hgaiser Thanks a gain. I have selected a maximum image size of 1000 pxls and that seems to work for now. Anything above 1000 pxls, and the OOM appears. However, I have trained Faster RCNN with vgg16 backbone in Matlab with image sizes upto 3000 pxls on the same GPU and it worked fine. Any idea as to why it is so different in your implementation?

How do you change image size? And would that mean I need to change annotations too?