Fast CPU emulator JIT function lookup
See original GitHub issueFeature Request
What feature are you suggesting?
Overview:
The CPU emulator works by recompiling Arm code into x86 code that can run on the target platform. Evidently, the code needs to be able to quickly find the respective x86 function for a given Arm function on the in-memory JIT cache. Currently this uses a dictionary on the Translator class, to find those functions. This is reasonably fast, but not as fast as it could be. Furthermore, the dictionary being a managed object, and only being accessible through managed functions, means it can’t be directly accessed from the native x86 JIT generated code.
Getting the x86 function pointer for a given Arm function thus requires calling NativeInterface.GetIndirectFunctionAddress and NativeInterface.GetFunctionAddressWithHint. Doing so is actually very slow, because it requires native <-> managed transitions (this is a cost paid every time native code is called from managed, and vice-versa). On top of that, we pay the cost of a bunch more calls and dictionary lookup, which is not actually that bad, but can become a issue if this is called often.
In order to minimise the cost of doing this, the translator will attempt to cache those function pointer in a small jump table, a small region of memory reserved for pointers. They work like so:
- For direct calls (when the target address is constant, like when the
BLArm instruction is used), it reserves one entry on the table that will hold the pointer, but is initially 0. The JIT generated code checks if the value there is 0, if so, it callsGetFunctionAddress, stores the pointer there, and calls it, otherwise it just calls the loaded pointer directly. This avoids the managed call cost on the 2nd run onwards. - For indirect calls (when the target address is unknown and comes from a register, like when the
BLRArm instruction is used), it also reserves one entry on the dynamic table, which holds an address pair. The first address is the Arm function address, while the second one is the host function address. Since it only reserves one entry, it can only possibly cache a single address, so if multiple functions address are called from the same call site, it will still be doing a lot of managed calls toGetIndirectFunctionAddress, which will hurt performance.
Here, I propose a solution that not only greatly simplifies all this logic, but also improves performance by addressing its shortcomings.
Smaller Details:
The proposed change replaces the dictionary that is currently used for function lookup with a multi-level table. Hash tables are O(1) in the best case, and O(n) in the worst case (if the hash function is extremely poor and all elements ends being shoved in the same bucket), meanwhile we have constant O(1) complexity with the multi-level table approach. However, the biggest benefit here is that the multi-level table can be easily accessed from native, JIT generated code, without requiring any loops.
Proposed design:
The Arm MMU can only address at most 48 bits of virtual memory. For this reason, I propose a multi-level table that is very similar to the page tables that the MMU itself supports. We should have 4 levels, with each level having 9 bits, and the level corresponding to the least significant bits having whatever remains (48 - 9 * 3). In order to reduce memory usage, we can take advantage of the fact that Arm instructions are always 4-byte aligned. Thus, the last level can be addressed with an address right shifted by 2, as we know the 2 least significant bits will be always 0. Note that, for Arm32, Thumb code might be 2-byte aligned, so that should be taken into account aswell for 32-bit games.
Furthermore, the current JIT cache is limited to 2GB. We can further reduce the size of the table by storing a 32-bit offset rather than a 64-bit funciton pointer. It would require an extra instruction on the JIT generate code to add the JIT cache base address to the offset, but allows halving the amount of memory used by the table, which is a very good compromise in my opinion. This also has the downside of limiting the JIT Cache size to up to around 4GB, but that should be plenty enough (if we need to use that much memory for JIT generated code, then we’re already screwed anyway).
Below I present a more visual representation of how the function address would be split, and how it becomes each index into the multi-level table:
|47 |38 |29 |20 2|
+---------------------------------------------+
| Level 3 | Level 2 | Level 1 | Level 0 |
+---------------------------------------------+
Pseudo-code of how one would get a host function pointer from a Arm guest function addresss with such a table:
level0 = (armFuncAddress >> 2) & 0x7FFFF; // Bits 0 and 1 are assumed to be always 0, as instructions are 4-byte aligned.
level1 = (armFuncAddress >> 21) & 0x1FF;
level2 = (armFuncAddress >> 30) & 0x1FF;
level3 = (armFuncAddress >> 39) & 0x1FF;
uint**** multiLevelTable; // Should be properly allocated and populated.
hostFuncPointer = jitCacheBasePointer + multiLevelTable[level3][level2][level1][level0];
One might also wish to use a different amount of levels, or change the amount of bits of each level (for example, from 9 to 10), in order to improve the memory usage, this was just an example. An optimal table should occupy around the same amount of memory as the uncompressed size of the code section of all executables, in the worst case (the worst case being, absolutely all the code in the executable was executed, which is very unlikely to happen in a single play session). This was calculated by calculating the amount of memory occupied by the 32-bit offsets for the entire executable, plus the “overhead” of the higher levels of the table (which contains pointers to the lower levels).
A potential concern is that direct calls (when the target address is constant) would be slower with this approach, however, it can be easily made as fast as the current approach, by simply reserving the entry on the multi-level table. Basically, for a constant address N, make sure that all level tables that would be accessed by a lookup for N are allocated, and access the level 0 entry pointer directly from the JIT generated code.
Nature of Request:
This not only simplifies the code, but will also improve performance. See the list below for all the benefits.
Why would this feature be useful?
Advantages:
- Greatly simplified code
- The jump table and dynamic table are no longer needed, as the pointers (or offsets) can be stored on the multi-level table directly.
- The PPTC saving and loading process is simpler and faster, as it no longer needs to deal with the aforementioned tables.
- JIT Cache eviction is a lot simpler to implement, as one does not need to worry about updating jump/dynamic tables. However, it is still necessary to ensure that the code is not currently running for safe removal, which is complicated.
- Improves performance of indirect calls with multiple target addresses.
- LowCq functions call also benefit from fast calls (direct or indirect) essentially for free.
Downsides:
- Memory usage might be a little bit higher than with the current approach, confirmation and exact numbers requires further measurements.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:11
- Comments:6 (6 by maintainers)

Top Related StackOverflow Question
Thats what I think aswell. And it would have the benefit of working with exefs mods and game updates. In fact we discussed this internally a few months ago, I will leave what I said here since it can be useful (the discussion was mostly about replacing PPTC with true AOT, which I think would work better, but has its own issue).
Here is a rough prototype: FICTURE7@a211f09010e2512f75a269b2939d4bc6b0217e28. It leverages NET 5.0’s pinned object heap to allocate the tables (that way we do not have to reserve, allocate and commit memory ourselves).
Some stuff to look into:
The tables are not initialized with the offset to the stub (like we currently do it), because the translation may call or tailcall into it and so there is a slow path for every translation switch. Haven’t looked into it too much, but I guess the stub may decide to call or tailcall? Not sure if this is feasible or if it will play nice with the stack unwinder.
Might not even be a big deal since we need the slow path to initialize the tables anyways, and this slow path should be hit only once. (If leave as is it will produce a lot of
LoadFromContextcode because the of the current SSA limitations).Crrently does not do proper call counting (because it relied on the stubs &
GetFunctionAddress*to do call counting before), all code remains in LCQ. Once a function is compiled it is placed in theAddressTable<T>and no call counting happens.I guess we could do that by inserting a call to do the call counting at the beginning of all LCQ functions? Will cancel out the gains to LCQ since it has to make a managed call for every execution (currently at least a 100 times if the incoming branch is hinted for HCQ). Or we could use another more compact/dense table to store the call counts and do the
ShouldRejit()check in the translation itself then call a tiering up function once it passed the threshold.Currently does not work with PPTC.
Currently does not do that: