Make chainer faster ( for small networks )
See original GitHub issueFor very small networks most of the time is spend on internal chainer stuff which could be optimized.
When I time some of my small cpu networks, the majority of the time is actually spend locally in the __call__
method of functions. This has actually become worse after introducing local_function_hooks
and allowing functions to be called on non-Variables.
I have a couple of suggestions for making chainer faster in these settings.
- cythonize
variable.py
,function.py
,flag.py
andcuda.py
and makeVariable
andFunction
cython extension types. - avoid calling
int(os.environ.get('CHAINER_TYPE_CHECK', '1')) != 0
every time a new function instance is created. - go back to only allowing functions to be calling on Variables (the outputs are always Variables anyway).
- avoid creating an
OrderedDict
(and updating it) if the function doesnt have any function hooks. - use cpython functions
PyTuple_New
,PyTuple_SET_ITEM
,Py_INCREF
with for-loops in cython for creating new tuples instead of creating list comprehesions and making tuple’s of these (inVariable.__call__
and inVariable.backward
).- Checking the device, the flags and the ranks of the inputs and making a new
in_data
tuple in theVariable.__call__
method can actually be done in a single for loop with cython. Currently, 8 loops is performed in order to do this (cuda.get_device(*in_data)
counts as 2 andflag.aggregate_flags([x.volatile for x in inputs])
counts as 3 ). Likewise, theself.outputs
andret
tuples can be build in a single loop with cython.
- Checking the device, the flags and the ranks of the inputs and making a new
- Make volatile flags integers instead of classes.
- By seperating functions from learnables in Functions and Links we paid a performance penalty, because every gradient of the learnables are now being newly created for every function call (very expensive with cupy), and then accumulated later, instead of being accumulated on the fly in the
Function.backward
methods. It would be cool if we could avoid this for learnables somehow (this is a big issue for RNNs on the gpu).
I think that backward-compatibility could be sacrificed for speed. Chainer is still young. If chainer is going to be able to compete with other frameworks such as Tensorflow, Theano and Torch we need more speed in my opinion.
Btw, if you cythonize variable.py you need to change del gxs
to gxs = None
( this is also better practice in my opinion) in the Variable.backward
method.
Issue Analytics
- State:
- Created 7 years ago
- Comments:10 (4 by maintainers)
Top GitHub Comments
In the next couple of weeks I will try to add these changes, and do some testing. Also, I will add the possibility to call
Function.forward
withoutput_buffer
arrays andFunction.backward
withgrad_inputs_buffer
arrays. Using preallocated buffers should speed up GPU computations alot. A buffer should of course only be written to once per forward/backward call.@honnibal: