Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Register Relay VM design

See original GitHub issue

Register VM Design

Current Relay VM RFC (#2810) proposes stack-based VM design which uses push and pop to maintain a unified stack. Though the design itself is simple to implement, it is cumbersome in tasks such as dataflow analysis and enforces certain orders in the execution.

I propose a register-based VM design as an alternative design. The main difference is that register-based VM uses registers to designate the operands and results instead of using the stack. Registers in the VM are virtual registers and each is a reference to a VMObject. We assume there are infinite registers so runtime doesn’t need to worry about register spilling. Registers are in SSA form where its index is unique in each VMFunction.

Calling convention in register VM is also simple. Each function has its own local register file. The first k registers in the register file of callee function are arguments passed in by caller function, and the VMObject in return register is allocated by callee function instead of caller function to avoid unnecessary memory copy. Caller function then assigns the reference to VMObject pointed by return register into its own register (see detail in Invoke instruction).

I summarize some pros and cons for register VM compared to stack VM.

Pros:

Data dependency analysis is self-explaining as register reveals the data dependency. In comparison, stack VM requires you to simulate the program and recover the stack index in order to get data dependency. As a result, it’ll be easier to implement a dataflow executor or async executor.
If/else scope is easier to handle. Previously stack VM needs to move the objects to the right place in the stack given it enters if or else branch. Now register VM only needs a new Phi instruction to select the result from either branch.
Tail recursion optimization should be easier since we only need to move the register to the correct slot in the register file.

Cons:

There is some memory overhead to keep the empty slots for registers that are never used.
Need to kill the register after its life cycle. This can be done by either annotating the life cycle of each register or inserting the kill instructions in the program during the life cycle analysis. Note that life cycle analysis pass is needed by both stack VM and register VM.

Instructions

We modify the current stack VM instructions to use registers as operands and results. Registers are designated by $k where k is the register index. *reg indicates a list of registers.

AllocTensor

AllocTenosr $1, ndim, *shape, dtype  ; %1 = AllocTensor(ndim, *shape, dtype)

Allocate a tensor object and stores to register $1.

AllocDatatype

AllocDatatype $1, tag, num_fields, *regs  ; %1 = AllocDatatype(tag, num_fields, *regs)

Allocate a datatype object and stores to register $1. *regs is a list of registers containing fields in the datatype.

AllocClosure

AllocClosure $1, func_index, num_freevars, *regs  ; %1 = AllocClosure(func_index, num_freevars, *regs)

Allocate a closure object, where there are num_freevars in *regs.

LoadConst

LoadConst $1, const_index  ; %1 = LoadConst(const_index)

Load the constant at const_index from the constant pool.

Mov

Mov $2, $1  ; %2 = Mov(%1)

Create a reference to VMObject in register $1 and stores it in $2.

Note: No data copy happens in this instruction. The real data copy should be a PackedFunction.

Phi

Phi $3, $1, $2  ; %3 = Phi(%1, %2)

Takes the VMObject either in $1 or $2 and stores in $3.

Note: This instruction requires VMObject in register $1 and $2 having the same type, and only one of them should be valid during the runtime.

Ret

Ret $1

Returns the register $1.

GetField

GetField $2, $1, field_index  ; %2 = GetField(%1, field_index)

Get the field at field_index in $1.

If

If $1, true_offset, false_offset

Check if $1 is true or false. If true relative jump by true_branch, else relative jump by false_branch.

Goto

Goto pc_offset

Relative unconditional jump by pc_offset.

InvokePacked

InvokePacked packed_index, arity, output_size, *regs

Invoke the packed function denoted by packed_index

Note: Number of registers in *regs should be arity + output_size where first arity registers are arguments and rest are output registers.

Invoke

Invoke $1, func_index, arity, *regs  ; %1 = Invoke(func_index, arity, *regs)

Invoke VM function at func_index

Note: Number of registers in *regs should be arity. Register $1 will be a reference to the VMObject in the return register.

InvokeClosure

InvokeClosure $2, $1, arity, *regs  ; %2 = InvokeClosure(%1, arity, *regs)

Invoke closure function in register $2 and save the result into register $2.

Note: Register $2 must be a VMClosureObject.

Stack and State

In order to convert to register-based VM, we also need to adjust the data structure in the VM. We assume there are infinite registers available, and each function has its own register file. Each time an Invoke instruction is executed, the runtime creates a new VMFrame. We pre-allocate max number of registers used in the callee function (this number can be derived during VMCompile) and assigns the arguments to the first args slots in the registers.

struct VMFrame {
  // Current function
  size_t func_index;
  // Pre-allocate max number of registers used in the functions
  std::vector<VMObject> registers;
  // Number of arguments
  size_t args;
  // Pointer into the current function's instructions.
  const Instruction* code;
  // Current program counter
  size_t pc;
};

Issue Analytics

State:
Created 4 years ago
Reactions:3
Comments:9 (8 by maintainers)

Top GitHub Comments

5reactions

wweiccommented, Apr 2, 2019

Summary

@tqchen @icemelon9 @jroesch @zhiics @yongwww we discuss in person. Reached the following consensus:

Remove Phi instruction. Instead extend If to write the result to a new register.
Reuse the existing value stack as the register file. Have an anchor in the function frame to point to the start of each function’s register region.
Try to do liveness on Relay AST to reuse the pass manager infrastructure, and to not introduce extra interfaces.

Let me know if I miss anything or said something wrong. I’ll take out liveness analysis on opcodes from my branch and polish remaining stuff(register VM + linear scan + interfaces). Since we don’t have liveness analysis on Relay AST now, I’ll simply generate live interval for each register with the full opcodes range, so register allocator can assign unique slot for each register.

4reactions

wweiccommented, Apr 1, 2019

I explored @icemelon9 's register vm design in my branch: https://github.com/wweic/tvm/commits/relay-rts. Would like to share some data points.

We need to add special registers in VM for function arguments and return value. Return register is necessary because when callee function returns, its register file has gone. So return value must be persisted outside of register file.

Opcode generation in register VM is easier, compiler doesn’t need to maintain stack offsets all along, just memorize variable to register mapping. Then we need register allocation, I tried liveness analysis + linear scan. It’s pretty straightforward since opcode is relatively simple. I’m not sure I understand @tqchen 's concern about handing control flow in register allocation. Note that Haichen’s register VM uses local register file per function. So we are doing register allocation per function, which shouldn’t need to worry about interprocedural register allocations. As for branches inside a function, live interval of register can be the range from the earliest opcode to the last opcode that the register is live. I think linear scan would work fine under this setting.

In register VM we need to allocate a register file(vector<Object>) per call, equal to growing the stack in stack VM. I’m not sure if there will be much performance difference.

Overall register VM is slightly easier to work with, debugging is painless. But I’m not sure if this is convincing enough to favor register VM. Also register VM’s opcode size is larger, it might be a concern for embedded devices.

The major selling point of register VM is easy to discover dataflow. But it begs the questions that is register VM better at leveraging the dataflow information? I think it requires more thought. Fundamentally both VMs have call stack and will have similar problems when integrating with async execution engines.

Top Results From Across the Web

[RFC] Relay Dynamic Runtime #2810 - apache/tvm - GitHub

This RFC proposes a new virtual machine for Relay programs. The virtual machine is designed to strike a balance between performance and ...

RFC 8014 - An Architecture for Data-Center Network ...

Section 3.4 provides background on Virtual Machine (VM) orchestration systems and their use of virtual networks. Black, et al. Informational [Page 6] RFC...

[RFC] Type-Directed Relay Fuzzing Library

This RFC proposes to employ fuzzing (mass generation of random programs) for Relay in order to test the compiler. Fuzz testing for Relay...

RFC 8293: A Framework for Multicast in Network Virtualization ...

Multicast Membership Management for DC with VMs . ... the NVE can be configured to forward these messages using the DHCP relay function...

[dmlc/tvm] [RFC][VM] Heterogeneous execution in Relay VM (#4178)

Current Design in Relay Graph Runtime ### Compilation Reference: ... The tensor object in each register should have the device context so vm...