Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make chainer faster ( for small networks )

See original GitHub issue

For very small networks most of the time is spend on internal chainer stuff which could be optimized.

When I time some of my small cpu networks, the majority of the time is actually spend locally in the __call__ method of functions. This has actually become worse after introducing local_function_hooks and allowing functions to be called on non-Variables.

I have a couple of suggestions for making chainer faster in these settings.

cythonize variable.py, function.py, flag.py and cuda.py and make Variable and Function cython extension types.
avoid calling int(os.environ.get('CHAINER_TYPE_CHECK', '1')) != 0 every time a new function instance is created.
go back to only allowing functions to be calling on Variables (the outputs are always Variables anyway).
avoid creating an OrderedDict (and updating it) if the function doesnt have any function hooks.
use cpython functions PyTuple_New, PyTuple_SET_ITEM, Py_INCREF with for-loops in cython for creating new tuples instead of creating list comprehesions and making tuple’s of these (in Variable.__call__and in Variable.backward ).
- Checking the device, the flags and the ranks of the inputs and making a new in_data tuple in the Variable.__call__ method can actually be done in a single for loop with cython. Currently, 8 loops is performed in order to do this ( cuda.get_device(*in_data) counts as 2 and flag.aggregate_flags([x.volatile for x in inputs]) counts as 3 ). Likewise, the self.outputs and ret tuples can be build in a single loop with cython.
Make volatile flags integers instead of classes.
By seperating functions from learnables in Functions and Links we paid a performance penalty, because every gradient of the learnables are now being newly created for every function call (very expensive with cupy), and then accumulated later, instead of being accumulated on the fly in the Function.backward methods. It would be cool if we could avoid this for learnables somehow (this is a big issue for RNNs on the gpu).

I think that backward-compatibility could be sacrificed for speed. Chainer is still young. If chainer is going to be able to compete with other frameworks such as Tensorflow, Theano and Torch we need more speed in my opinion.

Btw, if you cythonize variable.py you need to change del gxs to gxs = None ( this is also better practice in my opinion) in the Variable.backward method.

Issue Analytics

State:
Created 7 years ago
Comments:10 (4 by maintainers)

Top GitHub Comments

3reactions

bordingjcommented, Oct 17, 2016

In the next couple of weeks I will try to add these changes, and do some testing. Also, I will add the possibility to call Function.forward with output_buffer arrays and Function.backward with grad_inputs_buffer arrays. Using preallocated buffers should speed up GPU computations alot. A buffer should of course only be written to once per forward/backward call.

2reactions

jekbradburycommented, Nov 19, 2016

@honnibal:

Make sure to turn off typecheck (the core devs haven’t yet merged a fix that would make typecheck much faster; until then it should only be on for debugging)
A CPU profile will be misleading because time will be spent in whatever calls block due to GPU data from asynchronously spawned kernels not being available yet. That is, if you create some cupy arrays, call dot on them, and then print() the first element of the result a CPU profile will say your script spent 99% of time in something like ndarray.getitem.
This is only true for “big networks” like an ordinary LSTM of decent batch and hidden size. This thread is about “small networks” where each individual GPU kernel is so fast that the Python runtime can’t keep up and spawn the next one quickly enough to keep the GPU utilization at 100%. While there are still optimizations possible at full GPU utilization, they won’t look like the options proposed in this thread, which are targeted at situations and networks that can’t currently achieve 100% GPU utilization with Chainer.