Training slows down dramatically as epochs proceed in CPU
See original GitHub issueThe speed of a training dramatically slows down as epochs proceed in the CPU mode, especially when the batch size is small. In the example below, the 1st epoch took 30 seconds while the 3rd one took 39 seconds.
The same was observed with multiple CPU models (Core i5 6400, Xeon E5-2699v3), multiple Python versions (2.7.13, 3.6.0), and multiple chainer versions (1.23, 2.0.0b1).
~/src/chainer/examples/mnist $ export OMP_NUM_THREADS=1
~/src/chainer/examples/mnist $ ./train_mnist.py -g -1 -b 50 -e 3
GPU: -1
# unit: 1000
# Minibatch-size: 50
# epoch: 5
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
1 0.179801 0.108207 0.94595 0.9655 30.1475
2 0.0772594 0.0726388 0.976 0.9757 67.1415
3 0.0519136 0.0821774 0.983467 0.9756 106.503
The OS is Debian 9 and gcc is 6.3 (which might be too new). Python runtime and libraries were installed via Anaconda. export OMP_NUM_THREADS=1
limits the number of threads invoked by BLAS to 1 (I use this because using more than 1 thread does not improve the performance as the matrices processed are too small in mnist).
The below is profiling results of update_core_cpu()
in optimizers/adam.py
using line_profiler. The Time
column shows the elapsed time for each line in micro seconds, which shows that in the 3rd epoch, lines 60 and 62 took almost twice as long time as in the 1st epoch.
- 1st epoch
Total time: 14.883 s
File: /home/soramichi/src/anaconda2/lib/python2.7/site-packages/chainer/optimizers/adam.py
Function: update_core_cpu at line 52
Line # Hits Time Per Hit % Time Line Contents
==============================================================
52 @profile
53 def update_core_cpu(self, param):
54 7200 14427 2.0 0.1 grad = param.grad
55 7200 4218 0.6 0.0 if grad is None:
56 return
57 7200 3815 0.5 0.0 hp = self.hyperparam
58 7200 11194 1.6 0.1 m, v = self.state['m'], self.state['v']
59
60 7200 4065881 564.7 27.3 m += (1 - hp.beta1) * (grad - m)
61 7200 3986827 553.7 26.8 v += (1 - hp.beta2) * (grad * grad - v)
62 7200 6796601 944.0 45.7 param.data -= self.lr * m / (numpy.sqrt(v) + hp.eps)
- 3rd epoch
Total time: 22.6429 s
File: /home/soramichi/src/anaconda2/lib/python2.7/site-packages/chainer/optimizers/adam.py
Function: update_core_cpu at line 52
Line # Hits Time Per Hit % Time Line Contents
==============================================================
52 @profile
53 def update_core_cpu(self, param):
54 7200 14359 2.0 0.1 grad = param.grad
55 7200 4162 0.6 0.0 if grad is None:
56 return
57 7200 3933 0.5 0.0 hp = self.hyperparam
58 7200 11482 1.6 0.1 m, v = self.state['m'], self.state['v']
59
60 7200 7869889 1093.0 34.8 m += (1 - hp.beta1) * (grad - m)
61 7200 3928259 545.6 17.3 v += (1 - hp.beta2) * (grad * grad - v)
62 7200 10810785 1501.5 47.7 param.data -= self.lr * m / (numpy.sqrt(v) + hp.eps)
How to reproduce the profiling results:
- Run the mnist example and save a snapshot at the end of every epoch:
./train_mnist.py -g -1 -b 50 -e 3 -f 1
. - Put
@profile
decorator toupdate_cpu_core
function ofoptimizers/adam.py
. - In order to take a profile only for the target epoch (epoch 1 or 3), add an extension, to the mnist example, that raises an exception at the end of the target epoch.
# train_mnist.py
...
def die(trainer):
raise ValueError("die")
...
def main():
...
# die after the 1st epoch
trainer.extend(die, trigger=(1, 'epoch'))
...
- Resume the mnist example from the target epoch, using
kernprof
command (usable afterpip install line_profiler
).
# dies after the 1st epoch finishes,
# so that the profiling result contains only the 1st epoch
$ kernprof -v -l ./train_mnist.py -g -1 -b 50 -e 3
# resume from the end of 2nd epoch, dies when the 3rd epoch finishes
# (don't forget to change the trigger of the die extension)
$ kernprof -v -l ./train_mnist.py -g -1 -b 50 -e 3 --resume result/snapshot_iter_2400
Issue Analytics
- State:
- Created 6 years ago
- Comments:20 (13 by maintainers)
Top GitHub Comments
@niboshi Thank you for the detailed investigation.
I suspect this post might be relevant. What it says is numpy gets slower when the operands are super small numbers (called denormal), which require special handling because they do not fit in the normal format of floating point numbers.
I reopen the issue. It should be closed only after the fix is merged to the code.