question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Batch size, CUDA out of memory

See original GitHub issue

Hi,

Great package! I am currently using cellbender V2.1. I ran into an issue, which is caused by too high memory allocation.

[....]
cellbender:remove-background: [epoch 198]  average training loss: 1790.0774
cellbender:remove-background: [epoch 199]  average training loss: 1787.5904
cellbender:remove-background: [epoch 200]  average training loss: 1792.2732
cellbender:remove-background: [epoch 200] average test loss: 1773.5361
cellbender:remove-background: Inference procedure complete.
cellbender:remove-background: 2020-08-06 23:06:51
cellbender:remove-background: Preparing to write outputs to file...
cell counts tensor([ 8096.,  6134.,  1805.,  2324.,  5410.,  5546.,  5092.,  1724.,  5301.,
         1329.,  3143.,  5382.,   618.,  3833.,  6279.,  5066.,  2166.,  7982.,
         7920.,  3160.,  3907., 12285.,  3919.,  7285.,  1576.,  2011.,  1805.,
         5842.,  2688.,  8696.,  7202.,  7752.,  6153.,  4572.,  2058.,  7318.,
         3196.,  3786.,  7375.,  2877.,  2555.,  4179.,  1650.,  1776.,  4262.,
         4624.,  5314.,  5727.,  5470.,   693.,  4088.,  2078.,  1429.,  2127.,
         5265.,   649.,  4733.,  9864., 19365.,  7845.,  5621.,   699.,  3006.,
         3918.,  1308.,  6071.,  5948.,  1816.,  7495.,  3055.,  2016., 11080.,
         1845.,  1077., 14801.,  8278.,  2293.,  1718.,  1436.,  7260.,  1655.,
        13636.,  8505.,  1307.,  2211.,  7010.,  4465.,  1496.,  3346.,  8285.,
         1948.,  1978.,  2007.,  1693., 16839.,  6170.,  4675., 12212.,  1955.,
         1499.], device='cuda:0')
Traceback (most recent call last):
  File "path/to/bin/cellbender", line 33, in <module>
    sys.exit(load_entry_point('cellbender', 'console_scripts', 'cellbender')())
  File "path/to/CellBender/cellbender/base_cli.py", line 101, in main
    cli_dict[args.tool].run(args)
  File "path/to/cellbender/remove_background/cli.py", line 103, in run
    main(args)
  File "path/to/cellbender/remove_background/cli.py", line 196, in main
    run_remove_background(args)
  File "path/to/cellbender/remove_background/cli.py", line 166, in run_remove_background
    save_plots=True)
  File "path/to/cellbender/remove_background/data/dataset.py", line 524, in save_to_output_file
    inferred_count_matrix = self.posterior.mean
  File "path/to/cellbender/remove_background/infer.py", line 56, in mean
    self._get_mean()
  File "path/to/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "path/to/cellbender/remove_background/infer.py", line 402, in _get_mean
    alpha_est=map_est['alpha'])
  File "path/to/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "path/to/cellbender/remove_background/infer.py", line 1005, in _lambda_binary_search_given_fpr
    alpha_est=alpha_est)
  File "path/to/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "path/to/cellbender/remove_background/infer.py", line 809, in _calculate_expected_fpr_given_lambda_mult
    alpha_est=alpha_est)
  File "path/to/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "path/to/cellbender/remove_background/infer.py", line 604, in _true_counts_from_params
    .log_prob(noise_count_tensor)
  File path/to/lib/python3.7/site-packages/torch/distributions/poisson.py", line 63, in log_prob
    return (rate.log() * value) - rate - (value + 1).lgamma()
RuntimeError: CUDA out of memory. Tried to allocate 1016.00 MiB (GPU 0; 3.97 GiB total capacity; 2.48 GiB already allocated; 378.79 MiB free; 2.58 GiB reserved in total by PyTorch)

Do you suggest to change environmental settings, or adjust the batch size? Changing “empty-drop-training-fraction” did not solve the issue. Thanks for your thoughts!

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:14 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
mtvectorcommented, Aug 20, 2020

Love this hack idea, I’ve also run into the same problem on some GPUs, would be cool to have a parameter a user can input depending on the vram the GPUs they have access to so they can run 😃

1reaction
smorabitcommented, May 6, 2021

So I tried changing the batch size to 5 as suggested above, but I got the same error. However, after looking at #98, I tried to change the n_cells parameter to 10 rather than 100 (not sure if this is too low?) and with that it was able to finish running.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA out of memory error, cannot reduce batch size
It is because of mini-batch of data does not fit onto GPU memory. Just decrease the batch size. When I set batch size...
Read more >
Resolving CUDA Being Out of Memory With Gradient ...
The nub here is actually very easy to understand. The reason why model oscillates when batch size is too low is that the...
Read more >
CUDA out of memory,even set batch_size to 1 #185 - GitHub
In Google Colab, with a batch size of 1, it gives out of memory error for an audio 5 seconds long. waveglow =...
Read more >
Cuda out of memory error - Intermediate - Hugging Face Forums
I encounter the below error when I finetune my dataset on mbart RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU...
Read more >
Memory considerations – Machine Learning on GPU
By adding additional layers, work out how deep you can make your network before running out of GPU memory when using a batch...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found