[RFC] Register Relay VM design
See original GitHub issueRegister VM Design
Current Relay VM RFC (#2810) proposes stack-based VM design which uses push and pop to maintain a unified stack. Though the design itself is simple to implement, it is cumbersome in tasks such as dataflow analysis and enforces certain orders in the execution.
I propose a register-based VM design as an alternative design. The main difference is that register-based VM uses registers to designate the operands and results instead of using the stack. Registers in the VM are virtual registers and each is a reference to a VMObject
. We assume there are infinite registers so runtime doesn’t need to worry about register spilling. Registers are in SSA form where its index is unique in each VMFunction.
Calling convention in register VM is also simple. Each function has its own local register file. The first k
registers in the register file of callee function are arguments passed in by caller function, and the VMObject in return register is allocated by callee function instead of caller function to avoid unnecessary memory copy. Caller function then assigns the reference to VMObject pointed by return register into its own register (see detail in Invoke
instruction).
I summarize some pros and cons for register VM compared to stack VM.
Pros:
- Data dependency analysis is self-explaining as register reveals the data dependency. In comparison, stack VM requires you to simulate the program and recover the stack index in order to get data dependency. As a result, it’ll be easier to implement a dataflow executor or async executor.
- If/else scope is easier to handle. Previously stack VM needs to move the objects to the right place in the stack given it enters if or else branch. Now register VM only needs a new
Phi
instruction to select the result from either branch. - Tail recursion optimization should be easier since we only need to move the register to the correct slot in the register file.
Cons:
- There is some memory overhead to keep the empty slots for registers that are never used.
- Need to kill the register after its life cycle. This can be done by either annotating the life cycle of each register or inserting the kill instructions in the program during the life cycle analysis. Note that life cycle analysis pass is needed by both stack VM and register VM.
Instructions
We modify the current stack VM instructions to use registers as operands and results.
Registers are designated by $k
where k
is the register index. *reg
indicates a list of registers.
AllocTensor
AllocTenosr $1, ndim, *shape, dtype ; %1 = AllocTensor(ndim, *shape, dtype)
Allocate a tensor object and stores to register $1
.
AllocDatatype
AllocDatatype $1, tag, num_fields, *regs ; %1 = AllocDatatype(tag, num_fields, *regs)
Allocate a datatype object and stores to register $1
. *regs
is a list of registers containing fields in the datatype.
AllocClosure
AllocClosure $1, func_index, num_freevars, *regs ; %1 = AllocClosure(func_index, num_freevars, *regs)
Allocate a closure object, where there are num_freevars
in *regs
.
LoadConst
LoadConst $1, const_index ; %1 = LoadConst(const_index)
Load the constant at const_index
from the constant pool.
Mov
Mov $2, $1 ; %2 = Mov(%1)
Create a reference to VMObject in register $1
and stores it in $2
.
Note: No data copy happens in this instruction. The real data copy should be a PackedFunction.
Phi
Phi $3, $1, $2 ; %3 = Phi(%1, %2)
Takes the VMObject
either in $1
or $2
and stores in $3
.
Note: This instruction requires VMObject in register $1
and $2
having the same type, and only one of them should be valid during the runtime.
Ret
Ret $1
Returns the register $1
.
GetField
GetField $2, $1, field_index ; %2 = GetField(%1, field_index)
Get the field at field_index
in $1
.
If
If $1, true_offset, false_offset
Check if $1
is true or false. If true relative jump by true_branch, else relative jump by false_branch.
Goto
Goto pc_offset
Relative unconditional jump by pc_offset
.
InvokePacked
InvokePacked packed_index, arity, output_size, *regs
Invoke the packed function denoted by packed_index
Note: Number of registers in *regs
should be arity + output_size
where first arity
registers are arguments and rest are output registers.
Invoke
Invoke $1, func_index, arity, *regs ; %1 = Invoke(func_index, arity, *regs)
Invoke VM function at func_index
Note: Number of registers in *regs
should be arity
. Register $1
will be a reference to the VMObject in the return register.
InvokeClosure
InvokeClosure $2, $1, arity, *regs ; %2 = InvokeClosure(%1, arity, *regs)
Invoke closure function in register $2
and save the result into register $2
.
Note: Register $2
must be a VMClosureObject
.
Stack and State
In order to convert to register-based VM, we also need to adjust the data structure in the VM. We assume there are infinite registers available, and each function has its own register file. Each time an Invoke
instruction is executed, the runtime creates a new VMFrame
. We pre-allocate max number of registers used in the callee function (this number can be derived during VMCompile
) and assigns the arguments to the first args
slots in the registers.
struct VMFrame {
// Current function
size_t func_index;
// Pre-allocate max number of registers used in the functions
std::vector<VMObject> registers;
// Number of arguments
size_t args;
// Pointer into the current function's instructions.
const Instruction* code;
// Current program counter
size_t pc;
};
Issue Analytics
- State:
- Created 4 years ago
- Reactions:3
- Comments:9 (8 by maintainers)
Top GitHub Comments
Summary
@tqchen @icemelon9 @jroesch @zhiics @yongwww we discuss in person. Reached the following consensus:
Phi
instruction. Instead extendIf
to write the result to a new register.Let me know if I miss anything or said something wrong. I’ll take out liveness analysis on opcodes from my branch and polish remaining stuff(register VM + linear scan + interfaces). Since we don’t have liveness analysis on Relay AST now, I’ll simply generate live interval for each register with the full opcodes range, so register allocator can assign unique slot for each register.
I explored @icemelon9 's register vm design in my branch: https://github.com/wweic/tvm/commits/relay-rts. Would like to share some data points.
We need to add special registers in VM for function arguments and return value. Return register is necessary because when callee function returns, its register file has gone. So return value must be persisted outside of register file.
Opcode generation in register VM is easier, compiler doesn’t need to maintain stack offsets all along, just memorize variable to register mapping. Then we need register allocation, I tried liveness analysis + linear scan. It’s pretty straightforward since opcode is relatively simple. I’m not sure I understand @tqchen 's concern about handing control flow in register allocation. Note that Haichen’s register VM uses local register file per function. So we are doing register allocation per function, which shouldn’t need to worry about interprocedural register allocations. As for branches inside a function,
live interval
of register can be the range from the earliest opcode to the last opcode that the register is live. I think linear scan would work fine under this setting.In register VM we need to allocate a register file(
vector<Object>
) per call, equal to growing the stack in stack VM. I’m not sure if there will be much performance difference.Overall register VM is slightly easier to work with, debugging is painless. But I’m not sure if this is convincing enough to favor register VM. Also register VM’s opcode size is larger, it might be a concern for embedded devices.
The major selling point of register VM is easy to discover dataflow. But it begs the questions that is register VM better at leveraging the dataflow information? I think it requires more thought. Fundamentally both VMs have call stack and will have similar problems when integrating with async execution engines.