Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Relay Dynamic Runtime

See original GitHub issue

Putting the VM in TVM: The Relay Virtual Machine

We have previously introduced Relay, a new program representation, which is able to represent and optimize a greater breadth of machine learning programs. Unfortunately by supporting a more expressive set of programs we introduced several new execution challenges. So far we have not provided a production-ready solution to executing full Relay programs, i.e those containing control-flow, abstraction (both data and functional), as well as dynamism.

We introduced a “debug” interpreter which performs naive AST traversal to execute the program. This approach is conceptually simple but requires traversal of the program for each evaluation. The program is stored as a tree which makes heavy use of indirection and leads to inefficient execution. There are still unknown dynamism issues such as dynamic scheduling and allocation, fully dynamic tensor shapes, and control-flow. The interpreter has simple solutions for these, but none provide a compelling and optimized solution.

The second execution mechanism is the existing graph runtime, in order to target Relay programs to this we translate a small subset of them to the old graph format, and execute them on the runtime. This provides a solid execution experience but only for a very limited subset of Relay programs.

Finally we have developed an experimental ahead-of-time compiler, which transforms a Relay program into a shared library which implements it. This final approach provides compelling performance but is hard to extend with new approaches to handling dynamic programs.

This RFC proposes a new virtual machine for Relay programs. The virtual machine is designed to strike a balance between performance and flexibility when deploying and executing Relay programs, without giving up the benefits of TVM.

Virtual machine (VM) design is a well studied area in programming languages and systems, and there have been various virtual machine designs for both full fledged, and embedded programing languages. Previous language VM designs have been heavily tailored to the execution profile of traditional programs. Traditional programs manipulate small scalar values and consist of a large number of low level instructions. The sheer quantity of instructions to compute requires instruction execution and dispatch to be extremely efficient.

In the context of machine learning we manipulate primarily tensor values, using a (relatively) low number of high level instructions. ML program’s cost centers are expensive operator invocations such as GEMM or convolution, over a large input. Due to execution profile exhibited by ML programs micro-optimizations present in scalar-VMs are dramatically less important. A model’s runtime will be dominated by executing expensive operators on large inputs.

TVM has provided a strong support for vision models, but we want to grow to support a wider variety of models. The graph runtime is able to utilize the fully static nature of the input graphs to perform aggressive optimization such as fully static allocation, and optimal memory reuse. When we introduce models which make use of control-flow, recursion, dynamic shapes, dynamic allocation we must change how execution works.

The rest of this design document is focused on explaining a VM which addresses these challenges and and explores design decisions which remain.

Proposed Design

I have been experimenting with different designs and discussing how to solve this problem with members of the community for the past few months.

Our belief is the most important design aspects will be optimizing for cheap “allocation” of objects (by trying to avoid real allocation, reuse of static fragments, and the ability to do dynamic (i.e jagged tensors).

Instruction Set

The critical design choice of a VM is the instruction set and their representation. The current representation of the instructions is a tagged union, containing the op-code and the data payload. An important design decision is the level of abstraction of the instructions, and how they take their data, that is RISC vs. CISC and fixed-width instruction encoding vs. variable length. The current version is closer to CISC with complex instructions like AllocTensor, and is variable length due to the inclusion of the shape as part of the instruction. The current instruction set is very high level and corresponds roughly to high level operations in Relay.

Push

Arguments: size_t stack_index

Reads the value at base pointer + stack_index and pushes it on to the stack.

Pop

Arguments: size_t pop_count

Pops pop_count number of entries off the stack starting from the end.

Ret

Arguments: None

Returns from the current function call, popping off the last value of the frame stack and restoring the VM state that was recorded at the last call site. The last value on the stack is interpreted as the return value.

InvokePacked

Arguments: size_t packed_index size_t arity size_t output_size

Invoke the packed function denoted by packed_index. The arity and output size are used to inform the VM how many inputs and outputs to expect.

AllocTensor

Arguments: const std::vector<int64_t>& shape, DLDataType dtype);

Allocate a tensor value of the appropriate shape and dtype.

AllocDatatype

Arguments: (size_t tag, size_t num_fields);

Allocate a data type with the tag tag using the top num_fields entries on the stack as its fields.

AllocClosure

Arguments:

size_t func_index
size_t num_freevar

Allocate a closure with the VMFunction at func_index as its code, and the num_freevar entries on the stack as its free variables.

GetField

Arguments:

size_t object_offset
size_t field_index

If

Arguments:

size_t true_branch
size_t false_branch

Check if the top element on the stack is true or false. If true relative jump by true_branch, else relative jump by false_branch.

Goto

Arguments:

size_t pc_offset

Relative unconditional jump by pc_offset.

Invoke

Arguments:

size_t func_index

Invoke function at func_index, consumes the number of arguments contained in the VMFunction’s arity field, and places return value on stack as top element.

InvokeClosure

Arguments: None

Expects the top value on the stack is a closure. Invokes closure consuming the number of arguments declared in the closure_index’s VMFunction, and places return value on the stack.

LoadConst

Arguments:

size_t const_index

Load the constant at const_index from the constant pool.

Object Representation

We use a simple object representation that uses shared pointers and tagging. There is a huge space of object representations we can trade off here, but we believe micro-optimizing this code has little to no-effect on the end-to-end performance.

struct VMObjectCell {
  VMObjectTag tag;
  ...
};

struct VMObject {
  std::shared_ptr<VMObjectCell> ptr;
  ... 
}

See vm.h for more details.

Currently we support 3 types of objects, tensors, data types, and closures.

VMObject VMTensor(const tvm::runtime::NDArray& data);
VMObject VMDatatype(size_t tag, const std::vector<VMObject>& fields);
VMObject VMClosure(size_t func_index, std::vector<VMObject> free_vars);

Stack and State

The Relay VM consists of two important stacks, the value stack which acts as the normal call stack as used in C/C++/Java/etc, and the frame stack, which contains information about how to resume the previous call.

A stack machine is straightforward to implement, register-based virtual machines are more efficient in the scalar-world, but we believe in the tensor world that the complexity to performance gain tradeoff is not worth it.

We keep track of a set of Relay functions we have called, a pointer into its bytecode, a offset into the byte code known as the program counter, as well as an offset into the value stack which tells us where the stack frame begins known as the base pointer.

struct VirtualMachine {
    ...
    std::vector<VMFrame> frames;
    std::vector<VMObject> stack;
    ... 
    // Current function.
    size_t func_index;
    // Pointer into the current function's instructions.
    const Instruction* code;
    // Current program counter relative to the code pointer.
    size_t pc;
    // The current base pointer.
    size_t bp;
    ... 
};

Dispatch Loop

A very critical piece of a VM is the dispatch loop, usually this dominates execution time of a virtual machine, but experimentally we have found the performance of the loop to not be of much importance. We have just implemented a simple switch/goto dispatch loop which dispatches based on instruction op code.

This loop is implemented by VirtualMachine::Run().

It is my belief that this code is not as important to end-to-end performance as allocation, and memory reuse.

VM Compiler

An important part of this infrastructure is a compiler from Relay’s full IR into a sequence of bytecode. The VM compiler transforms a tvm::relay::Module into a tvm::relay::vm::VirtualMachine. The virtual machine contains a set of compiled functions, the compiled functions are contained in tvm::relay::vm::Function. The functions contain metadata about the the function as well as its compiled bytecode. For full definition of the data structures see vm.h.

Optimizations

There are quite a few optimizations required by the VM compiler.

We have implemented them in the old pass style, but plan to port them to the new pass manager (#2546) before merging.

A-Normal Form
Lambda Lift (see src/relay/vm/lambda_lift.cc)
Inline Primitives (see src/relay/vm/inline_primitives.cc)
Inliner (see src/relay/pass/inliner.cc)
Tail Call Optimization (see …)
Constant Pool Layout (see …)
ADT Tag Allocation (see …)
Liveness Analysis (see …)

Serialization

A final and yet to be implemented part of the VM design is serialization. This accompanying PR will introduce both the bytecode, its serialization, as well as VM level serialization. The idea being that a VM can be efficiently stored to disk and resumed at a later time. This would also allow us to efficiently schedule many models on to a single machine in order to obtain good utilization.

Unresolved Questions

How do we handle dynamic shapes?

I have another prototype extension to Relay which adds initial support for compiling and executing programs containing fully dynamic shapes. I will post an RFC and prototype PR on this subject soon.

How can we modify the VM to support JIT compilation of certain code paths?

In the code generation space there are still many tradeoffs to be analyzed and the VM is designed to be very flexible so we can modify it for future experiments.

How do we support heterogenous execution?

Heterogenous execution should work out of the box assuming we have annotated the appropriate device copies. In order to do this properly we need to run the device annotation and copying passes. We for see nothing too complex in this work.

Issue Analytics

State:
Created 5 years ago
Reactions:33
Comments:32 (20 by maintainers)

Top GitHub Comments

8reactions

jroeschcommented, Mar 16, 2019

My personal opinion is we need to abandon modeling VMs as high-level versions of an ISA, and its implementation as micro-architecture. We are in a fundamentally different world for executing Relay programs. The VM is only the control plane, the data plane is the the computation contained in highly optimized kernels produced by TVM.

We no longer compute over scalars the lessons learned from building scalar VMs and ISAs don’t really apply.

The program of question is a high level data-flow program where an order has been selected. One simple implementation of this instruction set changes each function call to place a future on the stack, which will eventually be completed by executing the function in parallel.

This design enables asynchronous and out of order execution while being completely opaque to the VM’s compiler and end-users. If we want to enable higher level forms of speculation and parallelism we can borrow ideas from parallel functional languages to change the program before scheduling on the VM (see https://simonmar.github.io/bib/papers/monad-par.pdf).

Simon has quite a few different designs in place and has used some of them to build a highly concurrent system now deployed at Facebook, the core of the library is here (https://github.com/facebook/Haxl).

On the topic of recursion, iteration and recursion are isomorphic. There is no fundamental difference, any tail-recursive function can be converted to a loop and vice-versa. In this case we perform tail-call optimization which removes the recursive call leaving only a goto instruction.

If we wanted to perform instruction level scheduling this would be no worse than the code generated by a loop, but maintains the high level properties enjoyed by Relay today.

One feature that may be more challenging is speculative execution. Speculative execution at the instruction level isn’t particularly interesting in my opinion. The only meaningful speculation is at control boundaries such as loops or conditions such as TensorFlow does (https://www.tensorflow.org/api_docs/python/tf/while_loop, see parallel iterations).

If we want to provide non-strict semantics we should just transform the source program and use a simple mechanism like the one exposed in the Par paper described above.

A final argument is that the current graph-runtime is effectively a stack, with each node having pointers into it, the only difference is it can not grow and/or shrink.

The most important concern in my opinion is memory optimization. The current prototype has a new allocator API which we intend to leverage. We can perform dynamic allocation with a specialized allocation cache which will avoid repeatedly allocating memory from the system allocator.

The next step is to perform an optimization where we group allocations into blocks, and then request a dynamic block of the correct size. In this case code like loops which release, and re-request an identical size block each iteration will end up reusing the same block.

A final optimization is applying a static analysis like the current memory planner where we recover static allocation sizes, note that we can no longer do this in all cases due to dynamic behavior (loops) and when we introduce dynamic tensors the shapes will not be known until runtime.

4reactions

jroeschcommented, Mar 25, 2019

The implementation of the hardware is not important to the high-level Relay program because all Tensor to Tensor functions are black box. They may be implemented anyway you want, in C++, in TVM, or as a hardware accelerator primitive. In order to map a subset of the program down to hardware you will have to unroll it, which is required for most fixed-function hardware. You can then replace the unrolled program as a new operation and rewrite the program to use this instead.

Hardware that does not support arbitrary & dynamic program sizes can not execute all models of interest, they fundamentally don’t fit into Halide/TVM style DSLs. The deep learning community has focused on optimizing for small subset of models with very regular behavior, but the next wave of models invalidates assumptions, such as statically known dimensions or static control-flow required by polyhedral optimizers. The point of the Relay VM is to coordinate at a higher level where you need iteration, dynamic allocation, and communication.

I have thought further about a register based VM and see no strong argument for why registers are better than stacks. Most of the research on dynamic VMs focus on this distinction in order to reduce memory movement and dispatch overhead while executing the application. Packed functions will dominate execution time and optimizing for dispatch is an incredibly premature optimization.

The other argument for register based VMs is instruction level parallelism. Again instructions don’t matter much here, meaningful parallelism is happening at data dependencies between operators, and inside the operators themselves (i.e parallel matrix mul).

The point of the parallel monad paper is not to use their technique for the source language, but use the execution model to get parallelism between operator invocations. We can view the future graph as the data dependency graph and do graph reduction over it.

For example if I depend on a sequence of function calls it is valid to evaluate them in parallel while evaluating a future computation that may depend on the results. The amount of synchronization needed here is very small, and the real opportunity for parallelism is inside operators. We don’t need to worry about where the results are stored, we essentially give it a register name when we push a future into stack position n.

In a sense we already have an infinite register because we can address any stack position. In this case we can easily address a future result by referencing position n. The only difference is the location where operations look for their result. We need a call stack for functions, and function calls are the primary operation based on observations of current workloads.

Furthermore the current approach makes the VMCompiler far simpler, and easier to extend.

I personally value simplicity, we have zero evidence that the current approach is slow, in fact we have evidence of the contrary. The initial prototype is already faster than MxNet’s executor which is used in production at AWS.

Top Results From Across the Web

Re: [dmlc/tvm] [RFC] Relay Dynamic Runtime (#2810)

Re: [dmlc/tvm] [RFC] Relay Dynamic Runtime (#2810) ... Closed #2810. -- You are receiving this because you are subscribed to this thread. Reply...

Dynamic Ops in Relay - pre-RFC - Apache TVM Discuss

To limit the impact to runtimes, we'd like to propose to features around dynamic shapes: A compile time check to ensure we only...

RFC 9232 - Network Telemetry Framework - IETF Datatracker

Network Telemetry Framework (RFC 9232, May 2022) ... data can be dynamically programmed or configured at runtime without interrupting the network operation, ...

RFC 7046: A Common API for Transparent Hybrid Multicast

This Group Name is typically learned at runtime from user interaction, such as the selection of an IPTV channel, or from dynamic session...

Non-Supported RFC Components of BFD - Palo Alto Networks

BFD for Static Routes ... To use BFD on a static route, both the firewall and the peer at the opposite end of...