Using train_sr for second stage results out of memory
See original GitHub issueAfter trying second stage learning, I get out of memory issue:
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).
localuser@localuser-All-Series:~/vc/become-yukarin$ python3 train_sr.py config_sr.json ../2ndstage/
/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/connection/convolution_2d.py:228: PerformanceWarning: The best algo of conv fwd might not be selected due to lack of workspace size (8388608)
auto_tune=auto_tune, tensor_core=tensor_core)
predictor/loss
Exception in main training loop: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).
Traceback (most recent call last):
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 315, in run
update()
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
self.update_core()
File "/home/localuser/vc/become-yukarin/become_yukarin/updater/sr_updater.py", line 79, in update_core
opt_predictor.update(loss.get, 'predictor')
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/optimizer.py", line 685, in update
loss.backward(loss_scale=self._loss_scale)
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 981, in backward
self._backward_main(retain_grad, loss_scale)
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 1061, in _backward_main
func, target_input_indexes, out_grad, in_grad)
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 179, in backprop_step
_reduce(gx)
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 10, in _reduce
grad_list[:] = [chainer.functions.add(*grad_list)]
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 229, in add
return Add().apply((lhs, rhs))[0]
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/function_node.py", line 263, in apply
outputs = self.forward(in_data)
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 156, in forward
y = utils.force_array(x[0] + x[1])
File "cupy/core/core.pyx", line 968, in cupy.core.core.ndarray.__add__
File "cupy/core/_kernel.pyx", line 930, in cupy.core._kernel.ufunc.__call__
File "cupy/core/_kernel.pyx", line 397, in cupy.core._kernel._get_out_args
File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
File "cupy/cuda/memory.pyx", line 1243, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1264, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1042, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1062, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
File "cupy/cuda/memory.pyx", line 784, in cupy.cuda.memory._try_malloc
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "train_sr.py", line 83, in <module>
trainer.run()
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 329, in run
six.reraise(*sys.exc_info())
File "/home/localuser/.local/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 315, in run
update()
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
self.update_core()
File "/home/localuser/vc/become-yukarin/become_yukarin/updater/sr_updater.py", line 79, in update_core
opt_predictor.update(loss.get, 'predictor')
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/optimizer.py", line 685, in update
loss.backward(loss_scale=self._loss_scale)
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 981, in backward
self._backward_main(retain_grad, loss_scale)
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 1061, in _backward_main
func, target_input_indexes, out_grad, in_grad)
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 179, in backprop_step
_reduce(gx)
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 10, in _reduce
grad_list[:] = [chainer.functions.add(*grad_list)]
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 229, in add
return Add().apply((lhs, rhs))[0]
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/function_node.py", line 263, in apply
outputs = self.forward(in_data)
File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 156, in forward
y = utils.force_array(x[0] + x[1])
File "cupy/core/core.pyx", line 968, in cupy.core.core.ndarray.__add__
File "cupy/core/_kernel.pyx", line 930, in cupy.core._kernel.ufunc.__call__
File "cupy/core/_kernel.pyx", line 397, in cupy.core._kernel._get_out_args
File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
File "cupy/cuda/memory.pyx", line 1243, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1264, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1042, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1062, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
File "cupy/cuda/memory.pyx", line 784, in cupy.cuda.memory._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).
It seems to allocate too much, as there is hardly anything using the videocard memory:
$ nvidia-smi
Sun Jan 24 16:27:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... On | 00000000:05:00.0 On | N/A |
| 0% 39C P8 11W / 275W | 1MiB / 11177MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Guide :: Fixing the Out of Memory Error - Steam Community
In my experience, Train Simulator is one of these games, so I strongly advise you to turn Game Mode off. Open the Start...
Read more >out of memory when using model.predict() #5337 - GitHub
Graphs in train phase and in predict phase are usually different, so they can result in a different memory allocation resulting in different ......
Read more >CUDA out of memory when using Trainer with compute_metrics
I'm trying to finetune a Bart model and while I can get it to train, I always run out of memory during the...
Read more >Manage memory differently on train and test time pytorch
While the model should use more memory on train phase because all the mid-step tensors(feature maps) are saved and with separable convolution ...
Read more >Out of memory error during evaluation but training works fine!
Surprisingly my old programs are throwing an out of memory error during evaluation (in eval() mode) but training works just fine. I am...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks, it helped! using about 20 files now with batch size on 5.
Try lowering the batch size here. https://github.com/Hiroshiba/become-yukarin/blob/99a4998f4b7b9def2079c42be0edfc70201a1856/recipe/config_sr.json#L26 I will close this Issue once, but please open it again if you need anything else.