"Blessed are the young, for they shall inherit the national debt."
- Herbert Hoover

Framework 4 (Last updated: October 25, 2019)
Framework 3 (Last updated: February 6, 2017)
Framework 2 (Last updated: October 8, 2006)
Framework (Last updated: October 8, 2006)
Libraries (Last updated: September 16, 2004)
Really old framework (Last updated: September 16, 2004)
Metaballs 2
Sunday, August 25, 2019 | Permalink

Executable
Source code
Metaballs2.zip (1.6 MB)

Required:
VK_KHR_8bit_storage
Recommended:
VK_NV_mesh_shader
This is a demo exploring the use of the Mesh Shader pipeline, which is a new compute-like pipeline for rasterization, allowing flexible processing and amplification of meshes. This demo is all amplification as there's no input mesh at all, and the only data is a small constant buffer with ball positions and radii. The output geometry is procedurally generated using Marching Cubes. A task shader is used for culling out cubes that aren't intersected by the isosurface, and a mesh shader generates the geometry for intersected cubes. Given that most cubes are empty space or fully inside the isosurface, this optimization results in more than an order of magnitude improvement in performance over just using a mesh shader.

While a number of optimizations have been employed and performance is a couple of orders of magnitude above the first naive implementation, it's probably possible to squeeze more performance out of it. In particular, I suspect processing several cubes per mesh shader invocation could improve performance. While all lanes get used for the field function evaluation, the number triangles per invocation is small and the output phase of the shader goes narrow. It may also be beneficial to go wider for the task shader, although the current lane utilization is near optimal.

This demo will run on any Vulkan capable GPU with the 8bit_storage extension using the compute shader fallback path. On GPUs supporting mesh shader (currently only NVIDIA RTX series (Turing)) you can toggle between mesh shader and compute implementation. Performance-wise both paths are roughly equal, with some differences depending on settings. The demo offers a number of options to toggle on the F1 dialog to compare implementations under different loads, including changing the number of balls, their size, and the grid density.

The compute implementation relies on a conservative memory allocation and could theoretically run out of memory and exhibit dropped geometry, although it has been set to a sufficiently high allocation that no artifacts have been observed. The allocation needed to support the absolute theoretical max is actually a full 27GB, which is larger than the GPU memory available on high-end GPUs as of this writing in 2019. In practice, the needed memory is far smaller than that.