question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using train_sr for second stage results out of memory

See original GitHub issue

After trying second stage learning, I get out of memory issue:

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).
localuser@localuser-All-Series:~/vc/become-yukarin$ python3 train_sr.py config_sr.json ../2ndstage/
/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/connection/convolution_2d.py:228: PerformanceWarning: The best algo of conv fwd might not be selected due to lack of workspace size (8388608)
  auto_tune=auto_tune, tensor_core=tensor_core)
predictor/loss
Exception in main training loop: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).
Traceback (most recent call last):
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/home/localuser/vc/become-yukarin/become_yukarin/updater/sr_updater.py", line 79, in update_core
    opt_predictor.update(loss.get, 'predictor')
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/optimizer.py", line 685, in update
    loss.backward(loss_scale=self._loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 981, in backward
    self._backward_main(retain_grad, loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 1061, in _backward_main
    func, target_input_indexes, out_grad, in_grad)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 179, in backprop_step
    _reduce(gx)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 10, in _reduce
    grad_list[:] = [chainer.functions.add(*grad_list)]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 229, in add
    return Add().apply((lhs, rhs))[0]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/function_node.py", line 263, in apply
    outputs = self.forward(in_data)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 156, in forward
    y = utils.force_array(x[0] + x[1])
  File "cupy/core/core.pyx", line 968, in cupy.core.core.ndarray.__add__
  File "cupy/core/_kernel.pyx", line 930, in cupy.core._kernel.ufunc.__call__
  File "cupy/core/_kernel.pyx", line 397, in cupy.core._kernel._get_out_args
  File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1243, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1264, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1042, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1062, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 784, in cupy.cuda.memory._try_malloc
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "train_sr.py", line 83, in <module>
    trainer.run()
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 329, in run
    six.reraise(*sys.exc_info())
  File "/home/localuser/.local/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/home/localuser/vc/become-yukarin/become_yukarin/updater/sr_updater.py", line 79, in update_core
    opt_predictor.update(loss.get, 'predictor')
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/optimizer.py", line 685, in update
    loss.backward(loss_scale=self._loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 981, in backward
    self._backward_main(retain_grad, loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 1061, in _backward_main
    func, target_input_indexes, out_grad, in_grad)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 179, in backprop_step
    _reduce(gx)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 10, in _reduce
    grad_list[:] = [chainer.functions.add(*grad_list)]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 229, in add
    return Add().apply((lhs, rhs))[0]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/function_node.py", line 263, in apply
    outputs = self.forward(in_data)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 156, in forward
    y = utils.force_array(x[0] + x[1])
  File "cupy/core/core.pyx", line 968, in cupy.core.core.ndarray.__add__
  File "cupy/core/_kernel.pyx", line 930, in cupy.core._kernel.ufunc.__call__
  File "cupy/core/_kernel.pyx", line 397, in cupy.core._kernel._get_out_args
  File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1243, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1264, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1042, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1062, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 784, in cupy.cuda.memory._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).

It seems to allocate too much, as there is hardly anything using the videocard memory:

$ nvidia-smi
Sun Jan 24 16:27:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:05:00.0  On |                  N/A |
|  0%   39C    P8    11W / 275W |      1MiB / 11177MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
Rose-syscommented, Jan 29, 2021

Thanks, it helped! using about 20 files now with batch size on 5.

0reactions
Hiroshibacommented, Jan 29, 2021

Try lowering the batch size here. https://github.com/Hiroshiba/become-yukarin/blob/99a4998f4b7b9def2079c42be0edfc70201a1856/recipe/config_sr.json#L26 I will close this Issue once, but please open it again if you need anything else.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Guide :: Fixing the Out of Memory Error - Steam Community
In my experience, Train Simulator is one of these games, so I strongly advise you to turn Game Mode off. Open the Start...
Read more >
out of memory when using model.predict() #5337 - GitHub
Graphs in train phase and in predict phase are usually different, so they can result in a different memory allocation resulting in different ......
Read more >
CUDA out of memory when using Trainer with compute_metrics
I'm trying to finetune a Bart model and while I can get it to train, I always run out of memory during the...
Read more >
Manage memory differently on train and test time pytorch
While the model should use more memory on train phase because all the mid-step tensors(feature maps) are saved and with separable convolution ...
Read more >
Out of memory error during evaluation but training works fine!
Surprisingly my old programs are throwing an out of memory error during evaluation (in eval() mode) but training works just fine. I am...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found