Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training slows down dramatically as epochs proceed in CPU

See original GitHub issue

The speed of a training dramatically slows down as epochs proceed in the CPU mode, especially when the batch size is small. In the example below, the 1st epoch took 30 seconds while the 3rd one took 39 seconds.

The same was observed with multiple CPU models (Core i5 6400, Xeon E5-2699v3), multiple Python versions (2.7.13, 3.6.0), and multiple chainer versions (1.23, 2.0.0b1).

~/src/chainer/examples/mnist $ export OMP_NUM_THREADS=1
~/src/chainer/examples/mnist $ ./train_mnist.py -g -1 -b 50 -e 3
GPU: -1
# unit: 1000
# Minibatch-size: 50
# epoch: 5

epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           0.179801    0.108207              0.94595        0.9655                    30.1475       
2           0.0772594   0.0726388             0.976          0.9757                    67.1415       
3           0.0519136   0.0821774             0.983467       0.9756                    106.503

The OS is Debian 9 and gcc is 6.3 (which might be too new). Python runtime and libraries were installed via Anaconda. export OMP_NUM_THREADS=1 limits the number of threads invoked by BLAS to 1 (I use this because using more than 1 thread does not improve the performance as the matrices processed are too small in mnist).

The below is profiling results of update_core_cpu() in optimizers/adam.py using line_profiler. The Time column shows the elapsed time for each line in micro seconds, which shows that in the 3rd epoch, lines 60 and 62 took almost twice as long time as in the 1st epoch.

1st epoch

Total time: 14.883 s
File: /home/soramichi/src/anaconda2/lib/python2.7/site-packages/chainer/optimizers/adam.py
Function: update_core_cpu at line 52

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    52                                               @profile
    53                                               def update_core_cpu(self, param):
    54      7200        14427      2.0      0.1          grad = param.grad
    55      7200         4218      0.6      0.0          if grad is None:
    56                                                       return
    57      7200         3815      0.5      0.0          hp = self.hyperparam
    58      7200        11194      1.6      0.1          m, v = self.state['m'], self.state['v']
    59                                           
    60      7200      4065881    564.7     27.3          m += (1 - hp.beta1) * (grad - m)
    61      7200      3986827    553.7     26.8          v += (1 - hp.beta2) * (grad * grad - v)
    62      7200      6796601    944.0     45.7          param.data -= self.lr * m / (numpy.sqrt(v) + hp.eps)

3rd epoch

Total time: 22.6429 s
File: /home/soramichi/src/anaconda2/lib/python2.7/site-packages/chainer/optimizers/adam.py
Function: update_core_cpu at line 52

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    52                                               @profile
    53                                               def update_core_cpu(self, param):
    54      7200        14359      2.0      0.1          grad = param.grad
    55      7200         4162      0.6      0.0          if grad is None:
    56                                                       return
    57      7200         3933      0.5      0.0          hp = self.hyperparam
    58      7200        11482      1.6      0.1          m, v = self.state['m'], self.state['v']
    59                                           
    60      7200      7869889   1093.0     34.8          m += (1 - hp.beta1) * (grad - m)
    61      7200      3928259    545.6     17.3          v += (1 - hp.beta2) * (grad * grad - v)
    62      7200     10810785   1501.5     47.7          param.data -= self.lr * m / (numpy.sqrt(v) + hp.eps)

How to reproduce the profiling results:

Run the mnist example and save a snapshot at the end of every epoch: ./train_mnist.py -g -1 -b 50 -e 3 -f 1.
Put @profile decorator to update_cpu_core function of optimizers/adam.py.
In order to take a profile only for the target epoch (epoch 1 or 3), add an extension, to the mnist example, that raises an exception at the end of the target epoch.

# train_mnist.py
...
def die(trainer):
    raise ValueError("die")
...
def main():
    ...
    # die after the 1st epoch 
    trainer.extend(die, trigger=(1, 'epoch'))
    ...

Resume the mnist example from the target epoch, using kernprof command (usable after pip install line_profiler).

# dies after the 1st epoch finishes,
# so that the profiling result contains only the 1st epoch
$ kernprof -v -l ./train_mnist.py -g -1 -b 50 -e 3 

# resume from the end of 2nd epoch, dies when the 3rd epoch finishes
# (don't forget to change the trigger of the die extension)
$  kernprof -v -l ./train_mnist.py -g -1 -b 50 -e 3 --resume result/snapshot_iter_2400

Issue Analytics

State:
Created 6 years ago
Comments:20 (13 by maintainers)

Top GitHub Comments

2reactions

soramichicommented, Apr 14, 2017

@niboshi Thank you for the detailed investigation.

I suspect this post might be relevant. What it says is numpy gets slower when the operands are super small numbers (called denormal), which require special handling because they do not fit in the normal format of floating point numbers.

1reaction

niboshicommented, Jun 14, 2017

I reopen the issue. It should be closed only after the fix is merged to the code.