Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fuse array elementwise operations at graph build time

See original GitHub issue

Linear array expressions in dask array sustain some overhead that may be optimized away, as noted in #2034, #2497.

Each operation (e.g. __add__) adds another set of tasks to the graph. This adds overhead both in time (cost of building the graph) and space (overhead of larger dictionary representing the graph).
The larger graph increases overhead of our optimization passes for fusing tasks. It’d be better to generate a better graph rather than relying on optimization passes to detect and fuse elementwise operations.

Here we compute the same elementwise operation in 2 ways:

First using operators directly on dask arrays. This generates individual tasks for each operator, which are then fused together during our optimization passes.
Second using map_blocks and a function representing the whole elementwise operation. This generates a single task for each block (much smaller graph) and requires no optimization.

In [1]: import numpy as np, dask.array as da

In [2]: x = np.random.normal(size=int(2e8))

In [3]: y = np.random.normal(size=int(2e8))

In [4]: dx = da.from_array(x, chunks=int(1e6))

In [5]: dy = da.from_array(y, chunks=int(1e6))

In [6]: def f(x, y):
   ...:     return (0.5 - x)**2 + 0.8 * (y - x**2)**2
   ...:

In [7]: %%time
   ...: o = f(dx, dy).max()
   ...: print("%d tasks" % len(o.dask))
   ...: _ = o.compute()
   ...:
2070 tasks
CPU times: user 3.34 s, sys: 1.47 s, total: 4.81 s
Wall time: 1.76 s

In [8]: %%time
   ...: o = da.map_blocks(f, dx, dy).max()
   ...: print("%d tasks" % len(o.dask))
   ...: _ = o.compute()
   ...:
870 tasks
CPU times: user 3.06 s, sys: 1.36 s, total: 4.42 s
Wall time: 1.53 s

From the above you can see using map_blocks on a single function results in a smaller graph and faster execution time. For larger arrays/array expressions the benefits are even larger.

It’d be nice to be able to write code like the first example (using operators) and have the graph be equivalent to the second (single mapped function). In #1946 a solution using task fusion was proposed. This is more general, but wouldn’t reduce the size of the generated graph until fusion time, which still incurs the cost of generating and optimizing the graph.

Instead we propose avoiding generating the large graphs in the first place by adding a simple expression system to dask.array. This would only cover the linear operations generated by da.atop (which backs a good number of functions/operators). By changing atop to return a special object that encodes the meaning of the atop operation but doesn’t generate the graph yet. Further calls to atop using that output would continue to build up expressions until an incompatible method was called. The expression structure could either be stored in a custom MutableMapping (as suggested in #1763) or in a thin subclass of da.Array.

Example

x = some_dask_array()
# No graph generated yet, just expression encoding `x + 1`
x2 = x + 1
x3 = x2 + x
x4 = da.sin(x3)
# `sum` isn't a linear expression, so graph for `x4` is generated
# and (potentially) cached in `x4`. This graph maps the function
# `sin((x + 1) + x)` across all blocks
x5 = x4.sum()

One further benefit of the simple expression system means that we have enough information that we could optionally optimize the evaluation of tasks, either by interpreting the expressions to use out= keywords to the numpy methods to reduce memory overhead, or by calling nb.jit on expressions. If done this should either be explicit or configurable, but would yield further performance improvements.

In [9]: import numba as nb

In [10]: f2 = nb.jit(f, nopython=True, nogil=True)

In [11]: f2.compile('f8,f8') # precompile
Out[11]: <function __main__._Closure.f>

In [12]: %%time
    ...: o = da.map_blocks(f2, dx, dy).max()
    ...: print("%d tasks" % len(o.dask))
    ...: _ = o.compute()
    ...:
    ...:
870 tasks
CPU times: user 2.1 s, sys: 285 ms, total: 2.39 s
Wall time: 613 ms

As a further extension, this could potentially solve #2431, as __getitem__ on the thin subclass of da.Array (da.AtopArray?) could be overridden to push slices back before the elementwise operations. This would be more robust than relying on an optimization pass.

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, Oct 23, 2018

This was done in #3998

0reactions

jakirkhamcommented, Nov 11, 2017

Another thing that could be really interesting along these lines would be (optionally) fusing the operations themselves with something like numexpr. Though this could probably be dealt with in a second stage.

Top Results From Across the Web

Fusion of Array Operations at Runtime - arXiv

Array operation fusion is a program transformation that combines, or fuses, multiple array operations into a kernel of operations.

1.4.2. Numerical operations on arrays - Scipy Lecture Notes

Try simple arithmetic elementwise operations: add even elements with odd elements; Time them against their pure python counterparts using %timeit . Generate ......

A Hands-On Introduction to the GraphBLAS - DTIC

Graph : A visual representation of a set of vertices and the connections between them (edges). By using a matrix, I can turn...

Optimize PyTorch Performance for Speed and Memory ...

Fuse the pointwise (elementwise) operations into a single kernel by PyTorch JIT ... Overall, you can optimize the time and memory usage by...

TensorFlow Graph Optimizations

Grappler: Default graph optimization system in the TF runtime. ○ Re-writes graphs to improve out-of-the-box TensorFlow performance.