Optimize tensordot with rechunk
See original GitHub issueDask.array tensordot operations can be made significantly faster by doing an initial rechunk pass, making all axes over which we intend to contract single-chunked.
For example, if we have the following block-chunked arrays:
import dask.array as da
x = da.random.random((500, 500, 500), chunks=(50, 50, 50))
y = da.random.random((500, 100), chunks=(50, 50))
And we want to contract over the 1st and 0th dimensions respectively
da.tensordot(x, y, axes=[1, 0]).visualize()
Then we may want to preemptively rechunk so that those axes have only one chunk as follows:
x2 = x.rechunk((..., 500, ...)).persist()
y2 = y.rechunk((500, ...)).persist()
We may want to contract other axes while we expand these ones to make sure that we don’t produce chunks that are too large (and to ensure that the resulting tensordot chunks are not too large). In this case though the outputs are small enough even with a 10x increase in size, so we leave the other dimensions as-is.
x2 = x.rechunk((50, 500, 50)).persist()
y2 = y.rechunk((500, 50)).persist()
This does incur some communication costs up front, but it will generally save us more communication down the line.
So I think the question here is the following:
Given the chunks of both arrays and the axis=
argument, how should we rechunk these arrays prior to the normal tensordot call. This should both increase the chunksize in the contracted axes to the full extent and possibly reduce the chunksize of the other dimension based on the expected nbytes of out output of the tensordot call.
Snagged this trick from this talk: https://youtu.be/dcT6c-PrloE?t=1584
cc @jcrist and @shoyer who might find this interesting. cc @pitrou who did the rechunk logic and might be able to recommend something.
Issue Analytics
- State:
- Created 6 years ago
- Comments:28 (25 by maintainers)
Top GitHub Comments
You can write/implement any
einsum
as a batch of independent GEMMs, so it should be able to benefit from the same fast kernels thattensordot
does. I also agree thattensordot
is still worth having for those cases where you don’t need the full flexibility of aneinsum
. Under the covers it could simply calleinsum
though.I think
einsum
is an operation where dask could really shine as a distributed scheduler, because it is an example of a powerful tool where the distribution and scheduling can make a big difference in performance.Thank you for sharing these diagrams @GenevieveBuckley .
@tomwhite , do these match your operational experience?