question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RELAY][PASS] Make Fusor to handle split gracefully

See original GitHub issue

Hi folks,

A common pattern in LSTM/GRU-style cells is a structure like (for simplicity):

        rnn_dim = 10
        X = relay.var("X", shape=(1, rnn_dim))
        W = relay.var("y", shape=(3 * rnn_dim, rnn_dim))
        matmul = relay.nn.dense(X, W)
        splitted = relay.split(matmul, indices_or_sections=3, axis=1)
        out = relay.sigmoid(splitted[0]) + relay.tanh(splitted[1]) * relay.exp(splitted[2])

Normally when implementing this in Relay, we’d expect that graph_fuse would fuse this entire sequence (matmul + split + sigmoid/tanh/exp/add/mul) into a single function, as that’s entirely reasonable expectation and generates the highest performance code. That is, we expect:

fn (%X: Tensor[(1, 10), float32],
    %y: Tensor[(30, 10), float32])
    -> Tensor[(1, 10), float32] {
  %0 = nn.dense(%X, %y, units=None)
  %1 = split(%0, indices_or_sections=int64(3), axis=1)
  %2 = %1.0
  %3 = sigmoid(%2)
  %4 = %1.1
  %5 = tanh(%4)
  %6 = %1.2
  %7 = exp(%6)
  %8 = multiply(%5, %7)
  %9 = add(%3, %8)
  %9
}

Instead, Relay generates something like:

fn (%X: Tensor[(1, 10), float32],
    %y: Tensor[(30, 10), float32])
    -> Tensor[(1, 10), float32] {
  %0 = fn(%p0: Tensor[(1, 10), float32],
          %p1: Tensor[(30, 10), float32])
          -> Tensor[(1, 30), float32] {
    %1 = nn.dense(%p0, %p1, units=None) # ty=Tensor[(1, 30), float32]
    %1
  }
  %2 = %0(%X, %y) # ty=Tensor[(1, 30), float32]
  %3 = fn(%p01: Tensor[(1, 30), float32])
          -> Tuple[Tensor[(1, 10), float32], Tensor[(1, 10), float32], Tensor[(1, 10), float32]] {
    %4 = split(%p01, indices_or_sections=int64(3), axis=1) # ty=Tuple[Tensor[(1, 10), float32], Tensor[(1, 10), float32], Tensor[(1, 10), float32]]
    %4
  }
  %5 = %3(%2) # ty=Tuple[Tensor[(1, 10), float32], Tensor[(1, 10), float32], Tensor[(1, 10), float32]]
  %6 = %5.0
  %7 = %5.1
  %8 = %5.2
  %9 = fn(%p02: Tensor[(1, 10), float32],
          %p11: Tensor[(1, 10), float32],
          %p2: Tensor[(1, 10), float32])
          -> Tensor[(1, 10), float32] {
    %10 = sigmoid(%p02) # ty=Tensor[(1, 10), float32]
    %11 = tanh(%p11) # ty=Tensor[(1, 10), float32]
    %12 = exp(%p2) # ty=Tensor[(1, 10), float32]
    %13 = multiply(%11, %12) # ty=Tensor[(1, 10), float32]
    %14 = add(%10, %13) # ty=Tensor[(1, 10), float32]
    %14
  }
  %15 = %9(%6, %7, %8) # ty=Tensor[(1, 10), float32]
  %15
}

While of course it would be possible to implement a “GateComputation” op or similar which is internally just (split + pointwise functions), but it would be quite elegant to avoid that if possible.

I’m not fluent in the Relay GraphFuser code, but I was hoping someone (@jroesch?) knows off the top of their head what needs to be modified inside the fuser, and I or someone else can do the implementation work.

cc @jroesch, @yidawang, @tqchen

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
jroeschcommented, Mar 26, 2019

@masahi thanks for taking point on this 👍

1reaction
ajtullochcommented, Mar 26, 2019

FWIW I realized this also introduces some interesting new ideas for scheduling. Normally in our existing frameworks we do stuff like batching the entire gate computation into a single X * k * D gemm (for k = 3 for GRU, 4 for LSTM, etc).

This has the problem that (once you can fuse with split), that to compute the gate values you inspect at D-strided locations in the output, and so in practice you end up fully realizing that intermediate computation. It may be probably more natural to instead reorder the single GEMM into vector-width blocked variants (i.e. instead of a (kD, D) matrix you’d reorder it into a ((D // V) x k x V, D) and then realize it at (k, V)-sized locations. Most existing frameworks I know of don’t take advantage of this, but conceptually it has the potential to be quite useful.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[RELAY][PASS] Make Fusor to handle split gracefully #2890
Hi folks, A common pattern in LSTM/GRU-style cells is a structure like (for simplicity): rnn_dim = 10 X = relay.var("X", shape=(1, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found