Devlog #1: Speed

Share this post:

Share on Bluesky Share on Twitter Share on Facebook

Hello everyone👋

Hopefully this is the right place, because I want to talk about the technical stuff that have been going on behind the scenes of V-Optimizer, especially V-Core, over the last few months. Since this is a hobby project, I do not work full time on this project and therefore do not have an update routine for the program or these devlogs. So these devlogs will appear irregularly.

What has changed in the last few months?

Since my goal is to use V-Core for real-time applications like games, I wrote a GDExtension for Godot 4 that converts the native C++ API of V-Core one-to-one into gdscript. This allows the use of the full feature set of V-Core in Godot without any C++ knowledge. To test the wrapper I wrote an importer using gdscript for importing all supported voxel formats (Magicavoxel, Goxel, Kenney Shape, Qubicle) as a scene.

That's all well and good, but there is something that currently makes V-Core unsuitable for games, namely the performance, or rather the speed of mesh generation.

The problem has been detected, how can it now be solved?
The first thing I did was to modify the `cli` program so that it meshes each input file at least ten times, stops the time required for this and calculates an average time. After that I got the following times on my machine (AMD FX-6300 6x 3.5Ghz) (Yeah I know it's old)

.\cli.exe .\teapot.vox -o *.glb -m greedy 
Time taken: 185 ms
Time taken: 204 ms
Time taken: 189 ms
Time taken: 275 ms
Time taken: 188 ms
Time taken: 192 ms
Time taken: 188 ms
Time taken: 192 ms
Time taken: 194 ms
Time taken: 187 ms
Time taken: 201 ms 
Average 219.5 ms
----------------

So on average it took 219.5 ms to mesh the whole `teapot` model, which is included in Magicavoxel by default. All that runs already in parallel with 6 threads. This time is not bad for a debug build and offline meshing (generate beforehand and import the mesh file into the game engine of your choice). But for a game, that's pretty damn slow.

The first thing I did was to replace the `std::unorderd_map` of C++ (Just the C++ naming for a unordered HashMap) with unordered_dense by martinus. This implementation uses open addressing and the SSE instruction set of the CPU to find everything you need as fast as possible. The performance boost was not bad, and it's capable to index a lot of vertices.

The next thing I tried to gain performance was to adapt the concept of the binary greedy meshing algorithm. The concept is relatively simple: you create a bit mask for each of your chunks and use bit shifting to determine where you need to render a face. For a better and more detailed explaination I would recommend this video by tantan and / or this one by Davis Morley. I was also able to increase the chunk size from 16 to 30 with the help of the bit mask, which also increased the meshing speed.

How much performance I gained after all these optimizations?

.\cli.exe .\teapot.vox -o *.glb -m greedy 
Time taken: 242 ms 
Time taken: 241 ms 
Time taken: 246 ms 
Time taken: 244 ms 
Time taken: 238 ms 
Time taken: 243 ms 
Time taken: 244 ms 
Time taken: 242 ms 
Time taken: 244 ms 
Time taken: 280 ms 
Time taken: 282 ms 
Average 274.6 ms
----------------

Well, a little disappointing? We got a slowdown by 55,1 ms, thats the opposite of speed right? So all that work was for nothing?

No, it was just a silly mistake. In development, I disabled the multithreaded meshing of the chunks to simplify debugging. This is the single-core performance (on my machine) of the greedy mesher.

After I re-enabled multithreading, I got the following result:

.\cli.exe .\teapot.vox -o *.glb -m greedy 
Time taken: 127 ms 
Time taken: 126 ms 
Time taken: 127 ms 
Time taken: 127 ms 
Time taken: 137 ms 
Time taken: 129 ms 
Time taken: 132 ms 
Time taken: 129 ms 
Time taken: 129 ms 
Time taken: 130 ms 
Time taken: 131 ms 
Average 142.4 ms
----------------

This is much more pleasing to see. We got a performance speed up by 77,1 ms and that for a debug build. There are a few more things to optimize for example the index generation for the vertices can be optimized even more.

Anyone interested in the project can view it on Github. Everything mentioned here can be found under the `performance` branch.

That's all for now. Bye👋