question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory error - Optimization to increase batch size

See original GitHub issue

When training with samples of size 256x256 pixels, a batch size over 32 causes a memory error from cuda. We have to find a way to optimize the process, in order to increase the batch size.

RunTimeError: cuda runtime error (2) : out of memory at /opt/conda/ .../THCStorage.cu:58

NOTE : May be specific to our (GC HPC) computing environment

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
epeterson12commented, Oct 26, 2018

Validation of Checkpointed results

Results when using checkpointing are slightly different from those of the original unetsmall model because CudNN has non-deterministic kernels. I ran tests using suggestions from pytorch discussions https://discuss.pytorch.org/t/non-reproducible-result-with-gpu/1831 and https://discuss.pytorch.org/t/deterministic-non-deterministic-results-with-pytorch/9087.

Using the same sample files, I ran train_model.py twice using the unetsmall model (batch_size = 32) and I ran it once using the checkpointed_unet model (batch_size = 50). Then, I classified some images with the resulting models.

The settings to try to get reproducible results were set as follows at the beginning of the code:

torch.backends.cudnn.deterministic = True
torch.manual_seed(999)
torch.cuda.manual_seed(999)
torch.cuda.manual_seed_all(999)
random.seed(0)

Also, in the DataLoaders, the parameters used for the instantiation had num_workers = 0 and shuffle = False Running the original unetsmall configuration without checkpoints 2 times yielded two slightly different results. Here are some examples of the results obtained when running image_classification.py on one of the training images with each trained model. Sections if the images that weren’t classified were left white.

Ground Truth Current Code Current Code 2 Checkpointed
1_rgb_8000_8000_ground_truth 1_rgb_8000_8000_original 1_rgb_8000_8000_original2 1_rgb_8000_8000_checkpoint
1_rgb_0_0_ground_truth 1_rgb_0_0_original 1_rgb_0_0_original2 1_rgb_0_0_checkpoint
on_5297_1_ground_truth on_5297_1_original on_5297_1_original2 on_5297_1_checkpointed

Please note that the configurations and the number of samples weren’t set to yield optimal results. Verifying reproducibility was the goal of these tests. The number of training samples was set to the number of samples produced during the samples creation.

global:
  samples_size: 256
  num_classes: 5
  data_path: /my/data/path
  number_of_bands: 3
  model_name: unetsmall     # One of unet, unetsmall, checkpointed_unet or ternausnet

sample:
  prep_csv_file: /my/prep/csv/file
  samples_dist: 200
  remove_background: True
  mask_input_image: False

training:
  output_path: /my/output/path
  num_trn_samples: 3356
  num_val_samples: 1370
  batch_size: 32
  num_epochs: 100
  learning_rate: 0.0001
  weight_decay: 0
  step_size: 4
  gamma: 0.9
  class_weights: False

models:
  unet:   &unet001
    dropout: False
    probability: 0.2    # Set with dropout
    pretrained: False   # optional
  unetsmall:
    <<: *unet001
  ternausnet:
    pretrained: ./models/TernausNet.pt    # Mandatory
  checkpointed_unet: 
    <<: *unet001

I think that the results of the checkpointed_unet are similar enough to the unetsmall’s results for us to consider that it is a good memory and time optimised version of our unetsmall net architecture. I have added it as a model choice for our program.

Throughout my tests, I observed that the models produced by training are more accurate when the randoms aren’t seeded. The checkpointed_unet, observationally, seems to be more affected by this then the unetsmall.

1reaction
epeterson12commented, Oct 17, 2018

Using checkpointing in the unetsmall net increases the speed of training. Tests were performed using the following parameters:

# Training Samples # Validation Samples Sample Size # Classes # Epochs
781 495 256 11 200
Learning Rate Weight Decay Step Size Gamma Class Weights Dropout
0.0001 0 4 0.9 False False

memory_usage_by_batch_size processing_time_by_batch_size

Best results

. Original Checkpointed
Max batch size 32 50
Time to complete training over 200 epochs 350 min 316 min

Using checkpoints in the net design does seem to affect the results of the training. Tests were done on the original and on the checkpointed nets while setting the random seed to 7 and the models outputted gave similar but slightly different results. In the first test, the original algorithm gave results closer to the ground truth. In the second tests, the checkpointed version of the net yielded better results.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Increase Training Performance Through Memory ...
The most basic example of GPU memory optimization is increasing your batch size to increase the memory utilization up to as close to...
Read more >
How to maximize GPU utilization by finding the right batch size
Increasing batch size is a straightforward technique to boost GPU usage, though it is not always successful. The gradient of the batch size...
Read more >
Role of Batch Size in optimization | by Yash Upadhyay - Medium
I have tried to summarize how batch size affects the learning in the ... amount of memory usage of the hardware scales up...
Read more >
How to Control the Stability of Training Neural Networks With ...
Batch size controls the accuracy of the estimate of the error gradient when training neural networks. Batch, Stochastic, and Minibatch gradient ...
Read more >
How To Fit a Bigger Model and Train It Faster - Hugging Face
Even when we set the batch size to 1 and use gradient accumulation we can still run out of memory when working with...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found