Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fast CPU emulator JIT function lookup

See original GitHub issue

Feature Request

What feature are you suggesting?

Overview:

The CPU emulator works by recompiling Arm code into x86 code that can run on the target platform. Evidently, the code needs to be able to quickly find the respective x86 function for a given Arm function on the in-memory JIT cache. Currently this uses a dictionary on the Translator class, to find those functions. This is reasonably fast, but not as fast as it could be. Furthermore, the dictionary being a managed object, and only being accessible through managed functions, means it can’t be directly accessed from the native x86 JIT generated code.

Getting the x86 function pointer for a given Arm function thus requires calling NativeInterface.GetIndirectFunctionAddress and NativeInterface.GetFunctionAddressWithHint. Doing so is actually very slow, because it requires native <-> managed transitions (this is a cost paid every time native code is called from managed, and vice-versa). On top of that, we pay the cost of a bunch more calls and dictionary lookup, which is not actually that bad, but can become a issue if this is called often.

In order to minimise the cost of doing this, the translator will attempt to cache those function pointer in a small jump table, a small region of memory reserved for pointers. They work like so:

For direct calls (when the target address is constant, like when the BL Arm instruction is used), it reserves one entry on the table that will hold the pointer, but is initially 0. The JIT generated code checks if the value there is 0, if so, it calls GetFunctionAddress, stores the pointer there, and calls it, otherwise it just calls the loaded pointer directly. This avoids the managed call cost on the 2nd run onwards.
For indirect calls (when the target address is unknown and comes from a register, like when the BLR Arm instruction is used), it also reserves one entry on the dynamic table, which holds an address pair. The first address is the Arm function address, while the second one is the host function address. Since it only reserves one entry, it can only possibly cache a single address, so if multiple functions address are called from the same call site, it will still be doing a lot of managed calls to GetIndirectFunctionAddress, which will hurt performance.

Here, I propose a solution that not only greatly simplifies all this logic, but also improves performance by addressing its shortcomings.

Smaller Details:

The proposed change replaces the dictionary that is currently used for function lookup with a multi-level table. Hash tables are O(1) in the best case, and O(n) in the worst case (if the hash function is extremely poor and all elements ends being shoved in the same bucket), meanwhile we have constant O(1) complexity with the multi-level table approach. However, the biggest benefit here is that the multi-level table can be easily accessed from native, JIT generated code, without requiring any loops.

Proposed design:

The Arm MMU can only address at most 48 bits of virtual memory. For this reason, I propose a multi-level table that is very similar to the page tables that the MMU itself supports. We should have 4 levels, with each level having 9 bits, and the level corresponding to the least significant bits having whatever remains (48 - 9 * 3). In order to reduce memory usage, we can take advantage of the fact that Arm instructions are always 4-byte aligned. Thus, the last level can be addressed with an address right shifted by 2, as we know the 2 least significant bits will be always 0. Note that, for Arm32, Thumb code might be 2-byte aligned, so that should be taken into account aswell for 32-bit games.

Furthermore, the current JIT cache is limited to 2GB. We can further reduce the size of the table by storing a 32-bit offset rather than a 64-bit funciton pointer. It would require an extra instruction on the JIT generate code to add the JIT cache base address to the offset, but allows halving the amount of memory used by the table, which is a very good compromise in my opinion. This also has the downside of limiting the JIT Cache size to up to around 4GB, but that should be plenty enough (if we need to use that much memory for JIT generated code, then we’re already screwed anyway).

Below I present a more visual representation of how the function address would be split, and how it becomes each index into the multi-level table:

|47         |38         |29         |20      2|
+---------------------------------------------+
|  Level 3  |  Level 2  |  Level 1  | Level 0 |
+---------------------------------------------+

Pseudo-code of how one would get a host function pointer from a Arm guest function addresss with such a table:

level0 = (armFuncAddress >> 2) & 0x7FFFF; // Bits 0 and 1 are assumed to be always 0, as instructions are 4-byte aligned.
level1 = (armFuncAddress >> 21) & 0x1FF;
level2 = (armFuncAddress >> 30) & 0x1FF;
level3 = (armFuncAddress >> 39) & 0x1FF;

uint**** multiLevelTable; // Should be properly allocated and populated.
hostFuncPointer = jitCacheBasePointer + multiLevelTable[level3][level2][level1][level0];

One might also wish to use a different amount of levels, or change the amount of bits of each level (for example, from 9 to 10), in order to improve the memory usage, this was just an example. An optimal table should occupy around the same amount of memory as the uncompressed size of the code section of all executables, in the worst case (the worst case being, absolutely all the code in the executable was executed, which is very unlikely to happen in a single play session). This was calculated by calculating the amount of memory occupied by the 32-bit offsets for the entire executable, plus the “overhead” of the higher levels of the table (which contains pointers to the lower levels).

A potential concern is that direct calls (when the target address is constant) would be slower with this approach, however, it can be easily made as fast as the current approach, by simply reserving the entry on the multi-level table. Basically, for a constant address N, make sure that all level tables that would be accessed by a lookup for N are allocated, and access the level 0 entry pointer directly from the JIT generated code.

Nature of Request:

This not only simplifies the code, but will also improve performance. See the list below for all the benefits.

Why would this feature be useful?

Advantages:

Greatly simplified code
- The jump table and dynamic table are no longer needed, as the pointers (or offsets) can be stored on the multi-level table directly.
- The PPTC saving and loading process is simpler and faster, as it no longer needs to deal with the aforementioned tables.
- JIT Cache eviction is a lot simpler to implement, as one does not need to worry about updating jump/dynamic tables. However, it is still necessary to ensure that the code is not currently running for safe removal, which is complicated.
Improves performance of indirect calls with multiple target addresses.
LowCq functions call also benefit from fast calls (direct or indirect) essentially for free.

Downsides:

Memory usage might be a little bit higher than with the current approach, confirmation and exact numbers requires further measurements.

Issue Analytics

State:
Created 2 years ago
Reactions:11
Comments:6 (6 by maintainers)

Top GitHub Comments

3reactions

gdkchancommented, Apr 4, 2021

ngl I do find the PPTC machinery to be a bit awkward in general, and I think it should have been an on disk cache which can be queried on demand, (i.e when GetOrTranslate misses).

Thats what I think aswell. And it would have the benefit of working with exefs mods and game updates. In fact we discussed this internally a few months ago, I will leave what I said here since it can be useful (the discussion was mostly about replacing PPTC with true AOT, which I think would work better, but has its own issue).

would be so nice to get rid of those pools (with a proper solution, of course) Im not working on AOT right now, I ended starting something else (completely unrelated) I mostly wanted to test how long it would take to recompile the entirety of a game (all NSOs), and also identify potential issues with AOT The main problems are:

High memory usage limits the number of threads that can be used for AOT compilation (mainly on systems with low amounts of RAM), but it may also hit performance on any system.

Bad function detection due to calls that can’t return means that there will be gaps and it will still JIT compile some functions.

I think the last one is the biggest problem, as the benefits of AOT are diminished if you still have constant JIT

There are also other improvements that could be implemented with either AOT or PPTC:

Single file cache for all games. Benefits: Reduced disk space due to shared code, simpler invalidation in case of update as a single file is deleted.

Look up functions by hash. Allows AOT/PPTC to work with modded games, NROs, and homebrew. Also allows re-using common functions on other games, without needing to re-compile and use more disk space. No new cache needed for game updates either. Possible downsides: Taking a bit longer to load all cached functions compared to what we have now.

And, as mentioned before, making Ptc non-static is essential to make multi-process work (otherwise, we can only enable PPTC for a single guest process)

3reactions

FICTURE7commented, Apr 4, 2021

Here is a rough prototype: FICTURE7@a211f09010e2512f75a269b2939d4bc6b0217e28. It leverages NET 5.0’s pinned object heap to allocate the tables (that way we do not have to reserve, allocate and commit memory ourselves).

Some stuff to look into:

The tables are not initialized with the offset to the stub (like we currently do it), because the translation may call or tailcall into it and so there is a slow path for every translation switch. Haven’t looked into it too much, but I guess the stub may decide to call or tailcall? Not sure if this is feasible or if it will play nice with the stack unwinder.

Might not even be a big deal since we need the slow path to initialize the tables anyways, and this slow path should be hit only once. (If leave as is it will produce a lot of LoadFromContext code because the of the current SSA limitations).
Crrently does not do proper call counting (because it relied on the stubs & GetFunctionAddress* to do call counting before), all code remains in LCQ. Once a function is compiled it is placed in the AddressTable<T> and no call counting happens.

I guess we could do that by inserting a call to do the call counting at the beginning of all LCQ functions? Will cancel out the gains to LCQ since it has to make a managed call for every execution (currently at least a 100 times if the incoming branch is hinted for HCQ). Or we could use another more compact/dense table to store the call counts and do the ShouldRejit() check in the translation itself then call a tiering up function once it passed the threshold.
Currently does not work with PPTC.
Currently does not do that:

A potential concern is that direct calls (when the target address is constant) would be slower with this approach, however, it can be easily made as fast as the current approach, by simply reserving the entry on the multi-level table. Basically, for a constant address N, make sure that all level tables that would be accessed by a lookup for N are allocated, and access the level 0 entry pointer directly from the JIT generated code.

Concurrency and stuffs.

Top Results From Across the Web

JIT CPU Emulation: A 6502 to x86 Dynamic Recompiler (Part 1)

The above strange-looking code works by dereferencing the incoming in pointer to obtain the 6502 opcode, then uses it as a direct lookup...

Fast lookups in JIT-compiled maps

This post shows a way of optimizing lookup performance in maps ... I reimplemented the RSX GPU command processor in the emulator, Nucleus....

A tour through melonDS's JIT recompiler Part 1

The heart of almost every emulator is the CPU emulation. ... That's where the speed of JIT recompilers comes from.

JIT emulation and tracking dirty memory blocks

A quick search suggests that QEMU might be using a similar mechanism to handle modifications to guest memory, but I'm unable to verify...

Compiling Python code with @jit - Numba documentation

jit () decorator. Using this decorator, you can mark a function for optimization by Numba's JIT compiler. Various invocation modes trigger differing compilation ......