question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training slows down dramatically as epochs proceed in CPU

See original GitHub issue

The speed of a training dramatically slows down as epochs proceed in the CPU mode, especially when the batch size is small. In the example below, the 1st epoch took 30 seconds while the 3rd one took 39 seconds.

The same was observed with multiple CPU models (Core i5 6400, Xeon E5-2699v3), multiple Python versions (2.7.13, 3.6.0), and multiple chainer versions (1.23, 2.0.0b1).

~/src/chainer/examples/mnist $ export OMP_NUM_THREADS=1
~/src/chainer/examples/mnist $ ./train_mnist.py -g -1 -b 50 -e 3
GPU: -1
# unit: 1000
# Minibatch-size: 50
# epoch: 5

epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           0.179801    0.108207              0.94595        0.9655                    30.1475       
2           0.0772594   0.0726388             0.976          0.9757                    67.1415       
3           0.0519136   0.0821774             0.983467       0.9756                    106.503 

The OS is Debian 9 and gcc is 6.3 (which might be too new). Python runtime and libraries were installed via Anaconda. export OMP_NUM_THREADS=1 limits the number of threads invoked by BLAS to 1 (I use this because using more than 1 thread does not improve the performance as the matrices processed are too small in mnist).

The below is profiling results of update_core_cpu() in optimizers/adam.py using line_profiler. The Time column shows the elapsed time for each line in micro seconds, which shows that in the 3rd epoch, lines 60 and 62 took almost twice as long time as in the 1st epoch.

  • 1st epoch
Total time: 14.883 s
File: /home/soramichi/src/anaconda2/lib/python2.7/site-packages/chainer/optimizers/adam.py
Function: update_core_cpu at line 52

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    52                                               @profile
    53                                               def update_core_cpu(self, param):
    54      7200        14427      2.0      0.1          grad = param.grad
    55      7200         4218      0.6      0.0          if grad is None:
    56                                                       return
    57      7200         3815      0.5      0.0          hp = self.hyperparam
    58      7200        11194      1.6      0.1          m, v = self.state['m'], self.state['v']
    59                                           
    60      7200      4065881    564.7     27.3          m += (1 - hp.beta1) * (grad - m)
    61      7200      3986827    553.7     26.8          v += (1 - hp.beta2) * (grad * grad - v)
    62      7200      6796601    944.0     45.7          param.data -= self.lr * m / (numpy.sqrt(v) + hp.eps)
  • 3rd epoch
Total time: 22.6429 s
File: /home/soramichi/src/anaconda2/lib/python2.7/site-packages/chainer/optimizers/adam.py
Function: update_core_cpu at line 52

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    52                                               @profile
    53                                               def update_core_cpu(self, param):
    54      7200        14359      2.0      0.1          grad = param.grad
    55      7200         4162      0.6      0.0          if grad is None:
    56                                                       return
    57      7200         3933      0.5      0.0          hp = self.hyperparam
    58      7200        11482      1.6      0.1          m, v = self.state['m'], self.state['v']
    59                                           
    60      7200      7869889   1093.0     34.8          m += (1 - hp.beta1) * (grad - m)
    61      7200      3928259    545.6     17.3          v += (1 - hp.beta2) * (grad * grad - v)
    62      7200     10810785   1501.5     47.7          param.data -= self.lr * m / (numpy.sqrt(v) + hp.eps)

How to reproduce the profiling results:

  1. Run the mnist example and save a snapshot at the end of every epoch: ./train_mnist.py -g -1 -b 50 -e 3 -f 1.
  2. Put @profile decorator to update_cpu_core function of optimizers/adam.py.
  3. In order to take a profile only for the target epoch (epoch 1 or 3), add an extension, to the mnist example, that raises an exception at the end of the target epoch.
# train_mnist.py
...
def die(trainer):
    raise ValueError("die")
...
def main():
    ...
    # die after the 1st epoch 
    trainer.extend(die, trigger=(1, 'epoch'))
    ...
  1. Resume the mnist example from the target epoch, using kernprof command (usable after pip install line_profiler).
# dies after the 1st epoch finishes,
# so that the profiling result contains only the 1st epoch
$ kernprof -v -l ./train_mnist.py -g -1 -b 50 -e 3 

# resume from the end of 2nd epoch, dies when the 3rd epoch finishes
# (don't forget to change the trigger of the die extension)
$  kernprof -v -l ./train_mnist.py -g -1 -b 50 -e 3 --resume result/snapshot_iter_2400

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:20 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
soramichicommented, Apr 14, 2017

@niboshi Thank you for the detailed investigation.

I suspect this post might be relevant. What it says is numpy gets slower when the operands are super small numbers (called denormal), which require special handling because they do not fit in the normal format of floating point numbers.

1reaction
niboshicommented, Jun 14, 2017

I reopen the issue. It should be closed only after the fix is merged to the code.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why the training slow down with time if training continuously ...
Why the training slow down with time if training continuously? And Gpu utilization begins to jitter dramatically?
Read more >
[1906.06669] One Epoch Is All You Need - arXiv
Under one epoch training, no overfitting occurs, and regularization method does nothing but slows down the training.
Read more >
Why Epochs take longer as learning proceeds?
As the number of epochs increase the error goes down and the neural network has less to learn from the given data. The...
Read more >
How to Solve Data Loading Bottlenecks in Your Deep ...
Basic operations such as cropping, flipping, and normalizing do not affect training time significantly; therefore, most of the time, you can ...
Read more >
Distributed training of deep learning models on Azure
When you train deep learning models, an often overlooked aspect is where to store the training data. If the storage is too slow...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found