Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Symbolic shape runtime

See original GitHub issue

Problem: In real world workload, not everything is static. We may want to inference image in different size, or batch size. In some work load, we may need to contact different number of embedding for each instance.

Design: Introduce upper_bound idea for memory allocation, use different view for different inputs. For example, the inference input shape is [n, 3, 224, 224], we will give hint of n’s upper bound 5, then first time memory allocation is based on [5, 3, 224, 224], but we may create view of [1, 3, 224, 224], [3, 3, 224, 224] during running. In this way we are using some memory to trade of dynamic.

Implementation:

During memory planing, given a map of {var : upper_bound}, memory planing process creates two columns: [storage_id, max_bytes] for runtime setup.
The shape in graph json is no longer a int, but an infix string expression of variable and constant.
When setup runtime, storage pool is set up by [storage_id, max_bytes] columns. The infix expression of shape is converted to postfix expression for fast evaluation. When variable in shape changes, the runtime will run evaluation of all expressions contains variable, then update view if this view is not in cache.

PR: https://github.com/dmlc/tvm/pull/2447

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:17 (17 by maintainers)

Top GitHub Comments

1reaction

antinucleoncommented, Feb 1, 2019

Let’s decompose this problem.

In order to support dynamic shape, we have to fix two problem:

Memory management.
Kernel support.

Memory management will determine whether a program is able to run or not, then may determine efficiency of the runtime. However Kernel support only determines efficiency.

If you agree with this, we can future decompose what blocks us: memory management.

There are two ways to handle dynamic memory: static allocation and create many views on static memory, or do it in dynamic, where we can call it AOT or JIT. Ideally, if we have a very powerful allocator, two ways are equivalent.

In this RFC we are going to use AOT approach. To know how much memory we need to allocate for each storage pool node, we also have two approaches: 1. Give a hint, then get a number 2. Pre compute expression of each pool node size. To quickly get it work this PR is using a max_bytes hint to get a fixed size, then create view. Note if we can get expression of storage node, JIT allocator will be much easier too.

At this point, we already have solutions to deal with memory management. Let’s come back to efficiency problem: kernels.

Assume we are doing tensorization + vectorization + loop partition, we will have a good default kernel (most of human code today is written in this way).
Assume for each work load, we have different kernel. If during compile we can have a map between the compiled function, we can handle it during create views. (see view_cache_)

----------------------------This is a separation line------------------------

End to end changes:

Python API:

b1 = tvm.var("n1")
b2 = tvm.var("n2")
b3 = tvm.var("n3")

shape_var_bounds = {
    b1 : 128,
    b2 : 32,
    b3 : 12
}



x = relay.var("x",
              shape=[b1,
                     4,
                     2, 3], dtype="float32")


y = relay.var("y",
              shape=[b2,
                     4,
                     2, 3], dtype="float32")

z = relay.op.tensor.concatenate([x, y], axis=0)


a = relay.var("a",
              shape=[b3,
                     4,
                     2, 3], dtype="float32")
b = relay.var("b",
              shape=[27,
                     4,
                     2, 3], dtype="float32")
c = relay.op.tensor.concatenate([a, b], axis=0)

out = relay.op.tensor.concatenate([z, c], axis=0)

func = relay.Function([x, y, a, b], out)


graph, mod, param = relay.build_module.build(func, target="llvm", shape_var_bounds=shape_var_bounds)

print(graph)

rt = tvm.contrib.graph_runtime.create(graph, mod, tvm.cpu())

In graph json, three attributes are added:

"symbolic_shape": 1,
"max_bytes": [
      "list_int",
      [
        12288,
        3072,
        1152,
        2592,
        19104
      ]
    ],
    "var_upper_bound": {
      "n1": 128,
      "n2": 32,
      "n3": 12
    },

In runtime, if symbolic_shape is true, memory allocation will according to max_bytes field. When shape variables changes, new view will be created.

This change is good for dynamic batch, as long as kernel loop on batch axis is not significantly slower.

0reactions

tqchencommented, Oct 8, 2019

superceded by relay vm solution, close for now