question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make chainer faster ( for small networks )

See original GitHub issue

For very small networks most of the time is spend on internal chainer stuff which could be optimized.

When I time some of my small cpu networks, the majority of the time is actually spend locally in the __call__ method of functions. This has actually become worse after introducing local_function_hooks and allowing functions to be called on non-Variables.

I have a couple of suggestions for making chainer faster in these settings.

  • cythonize variable.py, function.py, flag.py and cuda.py and make Variable and Function cython extension types.
  • avoid calling int(os.environ.get('CHAINER_TYPE_CHECK', '1')) != 0 every time a new function instance is created.
  • go back to only allowing functions to be calling on Variables (the outputs are always Variables anyway).
  • avoid creating an OrderedDict (and updating it) if the function doesnt have any function hooks.
  • use cpython functions PyTuple_New, PyTuple_SET_ITEM, Py_INCREF with for-loops in cython for creating new tuples instead of creating list comprehesions and making tuple’s of these (in Variable.__call__and in Variable.backward ).
    • Checking the device, the flags and the ranks of the inputs and making a new in_data tuple in the Variable.__call__ method can actually be done in a single for loop with cython. Currently, 8 loops is performed in order to do this ( cuda.get_device(*in_data) counts as 2 and flag.aggregate_flags([x.volatile for x in inputs]) counts as 3 ). Likewise, the self.outputs and ret tuples can be build in a single loop with cython.
  • Make volatile flags integers instead of classes.
  • By seperating functions from learnables in Functions and Links we paid a performance penalty, because every gradient of the learnables are now being newly created for every function call (very expensive with cupy), and then accumulated later, instead of being accumulated on the fly in the Function.backward methods. It would be cool if we could avoid this for learnables somehow (this is a big issue for RNNs on the gpu).

I think that backward-compatibility could be sacrificed for speed. Chainer is still young. If chainer is going to be able to compete with other frameworks such as Tensorflow, Theano and Torch we need more speed in my opinion.

Btw, if you cythonize variable.py you need to change del gxs to gxs = None ( this is also better practice in my opinion) in the Variable.backward method.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
bordingjcommented, Oct 17, 2016

In the next couple of weeks I will try to add these changes, and do some testing. Also, I will add the possibility to call Function.forward with output_buffer arrays and Function.backward with grad_inputs_buffer arrays. Using preallocated buffers should speed up GPU computations alot. A buffer should of course only be written to once per forward/backward call.

2reactions
jekbradburycommented, Nov 19, 2016

@honnibal:

  1. Make sure to turn off typecheck (the core devs haven’t yet merged a fix that would make typecheck much faster; until then it should only be on for debugging)
  2. A CPU profile will be misleading because time will be spent in whatever calls block due to GPU data from asynchronously spawned kernels not being available yet. That is, if you create some cupy arrays, call dot on them, and then print() the first element of the result a CPU profile will say your script spent 99% of time in something like ndarray.getitem.
  3. This is only true for “big networks” like an ordinary LSTM of decent batch and hidden size. This thread is about “small networks” where each individual GPU kernel is so fast that the Python runtime can’t keep up and spawn the next one quickly enough to keep the GPU utilization at 100%. While there are still optimizations possible at full GPU utilization, they won’t look like the options proposed in this thread, which are targeted at situations and networks that can’t currently achieve 100% GPU utilization with Chainer.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Object Detection with Faster R-CNN in Chainer - GitHub
Training. 1. Make sure chainercv has been installed. ChainerCV is a utility library enables Chainer to treat various datasets easily ...
Read more >
7 Steps To Efficient & Responsive Supply Chains (2022)
Learn how to improve ecommerce supply chain efficiency by implementing supply chain management best practices.
Read more >
Building a Transparent Supply Chain
This is especially true for companies engaged in thousands of transactions each day across a large network of supply chain partners and products....
Read more >
How to build more secure, resilient, next-gen U.S. supply chains
A dynamically resilient data-driven supply chain network will quickly detect, respond to, and recover from such changes by adjusting ...
Read more >
Tutorials | Substrate_ Docs - Substrate Documentation
The Get started tutorials illustrate the basics for working with blockchains nodes, including how to make nodes communicate with each other in a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found