Ideas For a High-Performance System Design
See original GitHub issueHey there,
I’ve recently done a bit of work on physics-related tasks and in the process spent some thoughts on how Farseer performance could be improved.
Performance Observations
First, some assumptions on certain performance aspects that hopefully are uncontroversial enough to agree on as a precondition:
- The fastest way to process data is to iterate over an array of structs.
- It tends to be faster to bulk-process batches of data in a tight loop than to process one item at a time.
- Parallelization has a synchronization overhead, except where no synchronization is necessary.
- Executing user code has a guard overhead, except where no context-specific guard is necessary.
- There is a non-zero overhead to abstractions like
List<T>
vs.T[]
when it comes to bulk-processing large amounts of data.
Potential Performance Sinks
Next, here is a list of Farseer system design decisions that potentially drag performance below the theoretical maximum:
- Shape, Joint and Body are all classes stored in lists or similar data structures.
- Users can subscribe to events that are invoked from the middle of the physics simulation.
- User events are invoked callback style, item-by-item.
- Parallelizing physics simulation internally (i.e. not by putting Farseer into its own thread) is complicated because every object may alter the state of every other.
Design Decision Draft for Maximizing Efficiency
Finally, I’ll just throw in some rather radical ideas on how to restructure the overall system to address the above issues:
- Make Body a struct and store all bodies in a World-global (potentially sparse) array.
- Make Joint a struct (generalized) and store all joints in a World-global (potentially sparse) array.
- Make Shape a struct (discriminated union) and store all shapes in a Body-local (dense) array.
- Double-buffer all of the above data within World, so a physics step / update can strictly read-only from buffer A (current frame) and write-only to buffer B (next frame). This enables internal parallelization without synchronization using
Parallel.For
and similar. - Grant users direct access to all of the above World-global arrays so they can perform efficient bulk operations.
- References between Body, Joint and Shape take the form of
int
-based handles that represent the access index in their world. - Instead of event callbacks, gather all occurrences (structs!) to “subscribed” events in an array over an update step and deliver them in bulk once the update step is finished. This enables users to do efficient batch-processing and frees Velcro from the need for state guards and their overhead. See issue #8.
- User code that absolutely requires to be executed within the physics frame (such as collision filters, but be really strict here) is required to consist of a pure function that gets all required data read-only via parameter and returns a value.
As a side effect to improved efficiency, less overhead and internal parallel execution, the above changes could also make serialization (see issue #6) a lot easier. It would also make pooling (see issue #5) obsolete.
On the downside, this would require users to adopt a new API and potentially be more careful due to less safety options being around. There will be a bigger need for good documentation so users are primed for certain aspects of the system. This might also generate a large amount of issues that need to be resolved in order to adapt to the new design, but if any of these changes should make it into Velcro, now would likely be a better opportunity to tackle them than later.
I’m totally aware that the above are quite radical suggestions and that I don’t have any idea about most of the internal structures and design decisions of Farseer, so please read it as a collection of ideas by an interested observer 😃
Issue Analytics
- State:
- Created 6 years ago
- Comments:26 (10 by maintainers)
I’ve had the multi-threading discussion more times than I care to count 😃
First of all, It is certainly possible to multi-thread the engine in specific scenarios, but years of experiments have taught us that a general solution hurts the common case. The things that have made a difference was cache-coherency and modern CPU features such as prefetch and pipelining. Keep small structures in CPU cache and you are suddenly running 2x speed. If I add just a bit of complexity such as keeping track of locks and have state objects for threads, then performance drops.
The most common case for this engine is not large games with incredibly advanced physics - it is small to medium simulations with 10-500 objects, and if I add on multithreading to speed up scenarios where you have 5000 objects, then we slow down the 10-500 objects case by x2. Not a great trade off.
In the 5000 objects case, you can certainly find a hotspot like ray-casts and multi-thread it in order to juggle more. We just have to remember that more threads do not equal faster simulation, it equals more simultaneous objects. In my tests, I had a ~20% increase in throughput by multi-threading with 4 cores; a terrible waste of resources gone to locking, state copying etc.
However, if you create a world and put that into a thread by itself, you suddenly have a 100% increase in throughput, and that scales linearly! So in larger games that actually need threading, it is much easier to copy state manually between the worlds on the macro level, than trying to squeeze anything out on the micro level.
With regards to islands - I did try to multithread (MT) them but keeping track of state across islands again degraded performance for the common case, while increasing performance for large simulations. I ended up with a version with A LOT of compiler conditions in order to get around it, and I was about to make a “MT edition” and a single thread edition, but I figured most people would choose the MT and then disregard the engine due to its bad performance in PoCs.
That being said, I’m very interested in what people can come up with. I’d love a multi-threaded version of the engine 😃
There’s the tiered JIT issue in coreclr. To me, it seems like the biggest blocker for more JIT optimizations is the requirement to be fast at compiling vs. fast at executing. A tiered JIT has the potential to work around this in a way. But in any case, that’s kind of future / hypothetical stuff. The other part is extending C# to allow writing more efficient code in the first place.
Anyway - really glad you think some of the above ideas are worth pursuing. I’m currently a bit short on time, but I’ll take an occasional look at this issue (and project) to see how things develop, or whether there’s opportunity for me to chime in with an occasional comment. And from my side, as a long-term Farseer user, feel free to break as much as you want with Velcro as long as it’s for the better. 👍