question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[tune] Error running trial: CUDA error: out of memory

See original GitHub issue

I ran the script with access to 2 GPUs each of 12GB, and the script itself was taking about 2.5GB, but I still got this error. Can someone please help debug?

2019-02-04 15:12:00,835	ERROR function_runner.py:83 -- Runner Thread raised error.
Traceback (most recent call last):
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/function_runner.py", line 80, in run
    self._entrypoint(*self._entrypoint_args)
  File "mnist_pytorch.py", line 298, in <lambda>
    lambda cfg, rprtr: train_mnist(args, cfg, rprtr))
  File "mnist_pytorch.py", line 279, in train_mnist
    test()
  File "mnist_pytorch.py", line 243, in test
    outputs = model(input_var)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/graphics/toyota-pytorch/training-scaffold_new/unet/runs/2019-01-30_2FACO2ZU/pspnet_model.py", line 90, in forward
    x = self.layer3(x)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torchvision/models/resnet.py", line 84, in forward
    out = self.conv3(out)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: out of memory
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/function_runner.py", line 84, in run
    raise e
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/function_runner.py", line 80, in run
    self._entrypoint(*self._entrypoint_args)
  File "mnist_pytorch.py", line 298, in <lambda>
    lambda cfg, rprtr: train_mnist(args, cfg, rprtr))
  File "mnist_pytorch.py", line 279, in train_mnist
    test()
  File "mnist_pytorch.py", line 243, in test
    outputs = model(input_var)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/graphics/toyota-pytorch/training-scaffold_new/unet/runs/2019-01-30_2FACO2ZU/pspnet_model.py", line 90, in forward
    x = self.layer3(x)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torchvision/models/resnet.py", line 84, in forward
    out = self.conv3(out)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: out of memory

2019-02-04 15:12:01,455	ERROR trial_runner.py:412 -- Error processing event.
Traceback (most recent call last):
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 378, in _process_events
    result = self.trial_executor.fetch_result(trial)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 228, in fetch_result
    result = ray.get(trial_future[0])
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/worker.py", line 2211, in get
    raise value
ray.worker.RayTaskError: ray_worker (pid=30628, host=thousandeyes)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/trainable.py", line 151, in train
    result = self._train()
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/function_runner.py", line 122, in _train
    result = self._status_reporter._get_and_clear_status()
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/function_runner.py", line 47, in _get_and_clear_status
    raise TuneError("Error running trial: " + str(self._error))
ray.tune.error.TuneError: Error running trial: CUDA error: out of memory

== Status ==
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 180.000: None | Iter 60.000: None | Iter 20.000: None
Bracket: Iter 180.000: None | Iter 60.000: None
Bracket: Iter 180.000: None
Resources requested: 0/12 CPUs, 0/2 GPUs
Memory usage on this node: 32.2/67.5 GB
Result logdir: /afs/csail.mit.edu/u/s/smadan/ray_results/exp
ERROR trials:
 - train_mnist_0_lr=0.02582,momentum=0.22039:	ERROR, 1 failures: /afs/csail.mit.edu/u/s/smadan/ray_results/exp/train_mnist_0_lr=0.02582,momentum=0.22039_2019-02-04_14-43-52y8vfjp8z/error_2019-02-04_14-51-00.txt
 - train_mnist_1_lr=0.056926,momentum=0.51918:	ERROR, 1 failures: /afs/csail.mit.edu/u/s/smadan/ray_results/exp/train_mnist_1_lr=0.056926,momentum=0.51918_2019-02-04_14-51-000o4yo8gk/error_2019-02-04_14-58-01.txt
 - train_mnist_2_lr=0.023527,momentum=0.32848:	ERROR, 1 failures: /afs/csail.mit.edu/u/s/smadan/ray_results/exp/train_mnist_2_lr=0.023527,momentum=0.32848_2019-02-04_14-58-0284_a3oon/error_2019-02-04_15-05-03.txt
 - train_mnist_3_lr=0.019479,momentum=0.35535:	ERROR, 1 failures: /afs/csail.mit.edu/u/s/smadan/ray_results/exp/train_mnist_3_lr=0.019479,momentum=0.35535_2019-02-04_15-05-03phsaa9_i/error_2019-02-04_15-12-01.txt

Any clues what is happening?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

6reactions
JessicaSchrouffcommented, Feb 15, 2019

Hi @Spandan-Madan , I’m also running out of memory on CUDA and wondering if there is a leak somewhere. Can you point out which error you fixed exactly? Thank you! Best, Jessica

4reactions
Spandan-Madancommented, Feb 6, 2019

Found the bug and corrected it! You were right, it wasn’t in the integration. It was a very silly error in the train function which was leading to memory leak in pytorch. Thanks for the help!

Read more comments on GitHub >

github_iconTop Results From Across the Web

How do I get tune.run to handle CUDA out of memory errors?
I would like it so that if CUDA runs out of memory on a given run that this is simply treat as a...
Read more >
Out of memory at every second trial using Ray Tune
I noticed that every second call reports an out of memory error.It looks like the memory is being freed, you can see in...
Read more >
GPU memory is empty, but CUDA out of memory error occurs
During training this code with ray tune (1 gpu for 1 trial), after few hours of training (about 20 trials) CUDA out of...
Read more >
Resolving CUDA Being Out of Memory With Gradient ...
So when you try to execute the training, and you don't have enough free CUDA memory available, then the framework you're using throws...
Read more >
Cuda out of memory during evaluation but training is fine
I have 2 gpus I can even fit batch size 8 or 16 during training but after first epoch, I always receive Cuda...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found