CSV input produces OOM (out of memory) error on GPU
See original GitHub issueHi I am using my image dataset with CSV files as described in the README file. The training starts and runs fine on CPU, but it produces an OOM message on GPU:
ResourceExhaustedError: OOM when allocating tensor with shape[216350,462] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: training_1/Adam/gradients/loss_1/classification_loss/pow_grad/Pow = Pow[T=DT_FLOAT, _class=["loc:@training_1/Adam/gradients/loss_1/classification_loss/pow_grad/Reshape"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](loss_1/classification_loss/Select_1, training_1/Adam/gradients/loss_1/regression_loss/Pow_grad/sub)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: loss_1/classification_loss/Mean_2/_755 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5547_loss_1/classification_loss/Mean_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
when using VGG16 backbone.
and with resnet50:
ResourceExhaustedError: OOM when allocating tensor with shape[171704,462] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: loss_2/classification_loss/logistic_loss/zeros_like = ZerosLike[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](loss_2/classification_loss/Log)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
any help would be very appreciated
Issue Analytics
- State:
- Created 5 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
Why am I getting GPU ran out of memory error here?
Usually, when OOM errors take place, it is because the batch_size is too big or your VRAM is too small. In your case,...
Read more >Getting memory error when training a larger dataset on the GPU
This error message is generated if the memory is not sufficient to manage the batch size.
Read more >Working with GPU - fastai
This GPU memory is not accessible to your program's needs and it's not re-usable between processes. If you run two processes, each executing...
Read more >NCCL WARN Cuda failure 'out of memory' after multiple hours ...
Hey, I'm using a single node with 4 T4 GPUs and getting gpu074:31329:31756 [3] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory' ...
Read more >14 Simple Tips to save RAM memory for 1+GB dataset - Kaggle
For medium-sized data, we're better off trying to get more out of pandas, ... sample_submission.csv - a sample submission file in the correct...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
One step processes one batch. The question is, what is an epoch. I’ve seen it in many cases be defined as “having processed all data once”, but what does that mean really? The network doesn’t care if it has seen all images or not.
Practically speaking, an epoch is just a “period of time”, or in the case of deep learning “a number of steps”. This has some benefits to configure it as a number of steps, regardless of the number of images in the dataset. COCO has more than 100k images, if we would set step size to 100k it would take a long time before it stores a snapshot and inspects the results. Setting the step size to 10k gives much more feedback on the progress of training. Additionally, when enabling data augmentation through random transformations, what do you define as “all data”? Using all augmentation parameters, the dataset becomes enormous.
Anyway, your issue is resolved, so I’ll close this. For further discussions I suggest to join the Slack channel.
How do you change image size? And would that mean I need to change annotations too?