After all the work carried out during the last months and culminated in the previous update, I took some time to experiment with some different things. I wanted to leave the rendering side for a bit and take a look at how to optimize other parts of the engine.
- Looking online for example code (like SSE Quaternion Multiplication or SSE 4x4 Matrix Inversion)
- Implementing functions for both SSE and NEON
- Unit testing my code for correctness and parity against non SIMD one
- Dealing with compilers details (always_inline, Vector Extensions, alignas)
- Analyzing generated assembly on godbolt.org
- Measuring performance on both PC and Android
At the end the performances were too close to what the compiler alone could achieve with auto-vectorization and instead of pursuing more optimizations I put the task on hold. It was still a fun and learning experience and it allows me to show you my SIMD benchmark utility.
It is powered by ImGUI, it can load and save data from test runs using Lua and then compare them.
After archiving the
simd branch on my repository I jumped on another big project: trying to come up with an ECS implementation for the nCine.
It would flatten the hierarchy of game elements and allow for a faster and possibly multi-threaded approach at updating them.
I have already put together some early tests but before going on I took some days to have a look at some profiling tools. Embedding custom timers and graphs into the engine can be fun, as can be writing ad hoc benchmarks, but for this task I needed something more powerful.
I stumbled upon Tracy, a very capable frame profiler that shows CPU and GPU zones, has Lua instrumentation, annotates allocations and locks and displays call stacks. Instrumentation of user code is easy, capturing data has a very low overhead and works locally or remotely on Linux, Windows, macOS and Android.
The profiler that collects and visualize the data has a super rich ImGui interface with line plots, histograms, colored sections and the ability to compare two captures side by side.
With the integration in place I can now be sure that I will be able to assess any performance change related with big refactorings or optimizations.