Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using train_sr for second stage results out of memory

See original GitHub issue

After trying second stage learning, I get out of memory issue:

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).
localuser@localuser-All-Series:~/vc/become-yukarin$ python3 train_sr.py config_sr.json ../2ndstage/
/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/connection/convolution_2d.py:228: PerformanceWarning: The best algo of conv fwd might not be selected due to lack of workspace size (8388608)
  auto_tune=auto_tune, tensor_core=tensor_core)
predictor/loss
Exception in main training loop: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).
Traceback (most recent call last):
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/home/localuser/vc/become-yukarin/become_yukarin/updater/sr_updater.py", line 79, in update_core
    opt_predictor.update(loss.get, 'predictor')
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/optimizer.py", line 685, in update
    loss.backward(loss_scale=self._loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 981, in backward
    self._backward_main(retain_grad, loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 1061, in _backward_main
    func, target_input_indexes, out_grad, in_grad)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 179, in backprop_step
    _reduce(gx)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 10, in _reduce
    grad_list[:] = [chainer.functions.add(*grad_list)]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 229, in add
    return Add().apply((lhs, rhs))[0]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/function_node.py", line 263, in apply
    outputs = self.forward(in_data)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 156, in forward
    y = utils.force_array(x[0] + x[1])
  File "cupy/core/core.pyx", line 968, in cupy.core.core.ndarray.__add__
  File "cupy/core/_kernel.pyx", line 930, in cupy.core._kernel.ufunc.__call__
  File "cupy/core/_kernel.pyx", line 397, in cupy.core._kernel._get_out_args
  File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1243, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1264, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1042, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1062, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 784, in cupy.cuda.memory._try_malloc
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "train_sr.py", line 83, in <module>
    trainer.run()
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 329, in run
    six.reraise(*sys.exc_info())
  File "/home/localuser/.local/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/home/localuser/vc/become-yukarin/become_yukarin/updater/sr_updater.py", line 79, in update_core
    opt_predictor.update(loss.get, 'predictor')
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/optimizer.py", line 685, in update
    loss.backward(loss_scale=self._loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 981, in backward
    self._backward_main(retain_grad, loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 1061, in _backward_main
    func, target_input_indexes, out_grad, in_grad)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 179, in backprop_step
    _reduce(gx)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 10, in _reduce
    grad_list[:] = [chainer.functions.add(*grad_list)]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 229, in add
    return Add().apply((lhs, rhs))[0]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/function_node.py", line 263, in apply
    outputs = self.forward(in_data)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 156, in forward
    y = utils.force_array(x[0] + x[1])
  File "cupy/core/core.pyx", line 968, in cupy.core.core.ndarray.__add__
  File "cupy/core/_kernel.pyx", line 930, in cupy.core._kernel.ufunc.__call__
  File "cupy/core/_kernel.pyx", line 397, in cupy.core._kernel._get_out_args
  File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1243, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1264, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1042, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1062, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 784, in cupy.cuda.memory._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).

It seems to allocate too much, as there is hardly anything using the videocard memory:

$ nvidia-smi
Sun Jan 24 16:27:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:05:00.0  On |                  N/A |
|  0%   39C    P8    11W / 275W |      1MiB / 11177MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+