[tune] Error running trial: CUDA error: out of memory
See original GitHub issueI ran the script with access to 2 GPUs each of 12GB, and the script itself was taking about 2.5GB, but I still got this error. Can someone please help debug?
2019-02-04 15:12:00,835 ERROR function_runner.py:83 -- Runner Thread raised error.
Traceback (most recent call last):
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/function_runner.py", line 80, in run
self._entrypoint(*self._entrypoint_args)
File "mnist_pytorch.py", line 298, in <lambda>
lambda cfg, rprtr: train_mnist(args, cfg, rprtr))
File "mnist_pytorch.py", line 279, in train_mnist
test()
File "mnist_pytorch.py", line 243, in test
outputs = model(input_var)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/data/graphics/toyota-pytorch/training-scaffold_new/unet/runs/2019-01-30_2FACO2ZU/pspnet_model.py", line 90, in forward
x = self.layer3(x)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torchvision/models/resnet.py", line 84, in forward
out = self.conv3(out)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: out of memory
Exception in thread Thread-1:
Traceback (most recent call last):
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/function_runner.py", line 84, in run
raise e
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/function_runner.py", line 80, in run
self._entrypoint(*self._entrypoint_args)
File "mnist_pytorch.py", line 298, in <lambda>
lambda cfg, rprtr: train_mnist(args, cfg, rprtr))
File "mnist_pytorch.py", line 279, in train_mnist
test()
File "mnist_pytorch.py", line 243, in test
outputs = model(input_var)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/data/graphics/toyota-pytorch/training-scaffold_new/unet/runs/2019-01-30_2FACO2ZU/pspnet_model.py", line 90, in forward
x = self.layer3(x)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torchvision/models/resnet.py", line 84, in forward
out = self.conv3(out)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: out of memory
2019-02-04 15:12:01,455 ERROR trial_runner.py:412 -- Error processing event.
Traceback (most recent call last):
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 378, in _process_events
result = self.trial_executor.fetch_result(trial)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 228, in fetch_result
result = ray.get(trial_future[0])
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/worker.py", line 2211, in get
raise value
ray.worker.RayTaskError: ray_worker (pid=30628, host=thousandeyes)
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/trainable.py", line 151, in train
result = self._train()
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/function_runner.py", line 122, in _train
result = self._status_reporter._get_and_clear_status()
File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/torch_tens/lib/python3.6/site-packages/ray/tune/function_runner.py", line 47, in _get_and_clear_status
raise TuneError("Error running trial: " + str(self._error))
ray.tune.error.TuneError: Error running trial: CUDA error: out of memory
== Status ==
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 180.000: None | Iter 60.000: None | Iter 20.000: None
Bracket: Iter 180.000: None | Iter 60.000: None
Bracket: Iter 180.000: None
Resources requested: 0/12 CPUs, 0/2 GPUs
Memory usage on this node: 32.2/67.5 GB
Result logdir: /afs/csail.mit.edu/u/s/smadan/ray_results/exp
ERROR trials:
- train_mnist_0_lr=0.02582,momentum=0.22039: ERROR, 1 failures: /afs/csail.mit.edu/u/s/smadan/ray_results/exp/train_mnist_0_lr=0.02582,momentum=0.22039_2019-02-04_14-43-52y8vfjp8z/error_2019-02-04_14-51-00.txt
- train_mnist_1_lr=0.056926,momentum=0.51918: ERROR, 1 failures: /afs/csail.mit.edu/u/s/smadan/ray_results/exp/train_mnist_1_lr=0.056926,momentum=0.51918_2019-02-04_14-51-000o4yo8gk/error_2019-02-04_14-58-01.txt
- train_mnist_2_lr=0.023527,momentum=0.32848: ERROR, 1 failures: /afs/csail.mit.edu/u/s/smadan/ray_results/exp/train_mnist_2_lr=0.023527,momentum=0.32848_2019-02-04_14-58-0284_a3oon/error_2019-02-04_15-05-03.txt
- train_mnist_3_lr=0.019479,momentum=0.35535: ERROR, 1 failures: /afs/csail.mit.edu/u/s/smadan/ray_results/exp/train_mnist_3_lr=0.019479,momentum=0.35535_2019-02-04_15-05-03phsaa9_i/error_2019-02-04_15-12-01.txt
Any clues what is happening?
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
How do I get tune.run to handle CUDA out of memory errors?
I would like it so that if CUDA runs out of memory on a given run that this is simply treat as a...
Read more >Out of memory at every second trial using Ray Tune
I noticed that every second call reports an out of memory error.It looks like the memory is being freed, you can see in...
Read more >GPU memory is empty, but CUDA out of memory error occurs
During training this code with ray tune (1 gpu for 1 trial), after few hours of training (about 20 trials) CUDA out of...
Read more >Resolving CUDA Being Out of Memory With Gradient ...
So when you try to execute the training, and you don't have enough free CUDA memory available, then the framework you're using throws...
Read more >Cuda out of memory during evaluation but training is fine
I have 2 gpus I can even fit batch size 8 or 16 during training but after first epoch, I always receive Cuda...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @Spandan-Madan , I’m also running out of memory on CUDA and wondering if there is a leak somewhere. Can you point out which error you fixed exactly? Thank you! Best, Jessica
Found the bug and corrected it! You were right, it wasn’t in the integration. It was a very silly error in the train function which was leading to memory leak in pytorch. Thanks for the help!