Antialiasing Raytracing

I wanted to try and see what it looked like to only have a single primary ray for each pixel but to change the offset of the intersection within that pixel randomly every frame. The results have to be seen in motion to properly evaluate, but below is an example of a single frame:

A gif also doesn’t properly show the effect because the framerate is too low, but it can give a reasonable idea:

If you want to see it in action you could download the program. The controls are:

  • Arrow keys to move the block
    • (I to move it away from you and K to move it towards you)
  • WASD to move the camera

The program relies on several somewhat new features, but if you have a GPU that supports ray tracing and you have a newish driver for it then the program will probably work. Although the program does error checking there is no mechanism for reporting errors; this means that if you try to run it but it immediately exits out then it may be because your GPU or driver doesn’t support the required features.

Samples and Filtering

In a previous post where I showed ray tracing the rays were intersecting the exact center of every pixel and that resulted in the exact same kind of aliasing that happens with standard rasterization (and for the same reason). This can be fixed by taking more samples (i.e. tracing more rays) for each pixel and then filtering the results but it feels wasteful somehow to trace more primary rays when the secondary rays are where I really want to be making the GPU work. I wondered what it would look like to only trace a single ray per-pixel but to adjust the sample position differently every frame, kind of similar to how dithering works.

You can see a single frame capture of what this looks like below:

This is displaying linear color values as if they were sRGB gamma-encoded, which is why it looks darker (it is a capture from PIX of an internal render target). Ignore the difference in color luminances, and instead focus on the edges of the cubes (and the edges of the mirror, too, although the effect is harder to see there). Every pixel will only show an intersection with a single object, and if the area around that pixel contains an object’s edge then by changing where the intersection happens in that area sometimes a pixel will show one object and sometimes another object. (Side note: Just to be clear, this effect happens at every pixel and not just edges of objects. If you look closely you might be able to tell that the color gradients also have noise in them. The edges of objects, however, is where the results of these jittered ray intersections are really noticeable.)

In a single frame this makes the edges look bumpy. In a sequence of frames (like when the program is being run) the effect is more noisy; the edges still feel kind of bumpy but the bumpiness changes every frame. I had hoped that it might look pretty good in motion but unfortunately it is pretty distracting and doesn’t feel great.

Even with only a single intersection per pixel, however, it is possible to use samples from neighboring pixels in order to filter each pixel. That means that you could have up to nine samples for each pixel, even though only a single one of those nine samples would actually be inside of the pixel’s area. Let me repeat the preceding screenshot but then also show the version that is using filtering with the neighboring eight pixels so that you can compare the two versions next to each other:

The same bumpiness contours are still there, but now they have been smoothed out. This definitely looks better, but it also makes the edges feel blurry (which makes sense because that is exactly what is happening). The real question is how does it look in motion?

Unfortunately this also didn’t look as good as I had hoped. It definitely does function like antialiasing, which is great, and you no longer see discreet lines of pixels appearing or disappearing as the cubes get closer or further from the camera. At the same time, the noise is more obvious than I had hoped. I think it is the kind of visual effect that some people won’t mind and will stop to notice but that other people will really find objectionable and constantly distracting. Again, if you’re interested, I encourage you to download the program yourself and try it to see what you think!

Is it Usable?

For now I am going to keep it. I like the idea of doing antialiasing “properly” and I think it will be interesting to see how this performs with smaller details (by moving the cubes very far from the camera I can get some sense for this, but I’d like to see it with actual detailed objects). There is also an obvious path for improvement by adding more samples per pixel, although computational expense may not allow that.

I haven’t looked into denoisers beyond a very cursory understanding and so it’s not clear to me whether denoising would make this problem better or whether using this technique would actually inhibit denoisers being able to work well.

I personally kind of like a bit of noise because it makes things feel more analog and real to me, but I realize that this is an individual quirk of mine that is far from universal (as evidenced by the loud complaints I read online when games have added film grain). Although the effect ends up being more distracting than I had hoped (I can’t claim that I would intentionally make it look like this if I had a choice) I still think it could potentially work as kind of an aesthetic with other noise. This will have to be evaluated further, of course, as I add more ray tracing effects with noise and create more complicated scenes with more complicated geometry.

Real-Time Ray Tracing on the GPU

It has been a long-standing goal of mine to program something using the hardware ray tracing capabilities that are now available and I finally have something to show:

I was introduced to ray tracing when I was a student at the University of Utah (my pages for class assignments were up for years and it was fun to go look at the images I had rendered; sadly, they eventually got removed and so I can’t link to them). At the time there was a lot of work being done there developing real-time ray tracers (done in software but not with the help of GPUs) which was really cool, but after I graduated I moved on to the games industry and although I never lost my love of ray tracing I don’t know that I ever did any further personal projects using it because I became focused on techniques being used in games.

When the RTX cards were introduced it was really exciting but I still had never made the time to do any programming with GPU ray tracing until now. It has been both fun and also gratifying to finally get something working.

Reflections

I first got ray tracing working a few weeks ago but I haven’t written a post until now. The problem was that the simple scene I had didn’t look any different from how it did when it was rasterized. Even though I was excited because I knew that the image was actually being generated with rays there wasn’t really any way to show that. The classic (stereotypical?) thing to do with ray tracing is reflections, though, and so I finally implemented that and here it is:

There are two colorful cubes and then a mirror-like wall behind them and a mirror-like floor below them:

I initially just had the wall mirror but I realized that it still showed something that could be done with rasterization (by rendering the scene a second time or having mirror image geometry, which is how older games showed mirrors or reflective floors) and so I added the floor mirror in order to show double reflections which really wouldn’t be seen with rasterization. Here is a screenshot where a single cube (in the lower-right corner) is shown four times because each reflective surface is also reflecting the other reflective surface:

Implementing Hardware Ray Tracing

One of the nice things about ray tracing compared to rasterization is how simple it is. The more advanced techniques used to make compelling images are harder to understand and require complexity to implement, but the basics of generating an image are quite straightforward and pretty easy to understand. I wondered if that would translate into working with the GPU, and I have good news but also bad news:

The Good News

The actual shaders that I wrote are refreshingly straightforward, and the code is essentially the same kind of thing I would write doing traditional CPU software ray tracing. I am using DirectX, and especially when using dynamic resource indexing it is really easy to write code the way that I would want to (in fact, with how I am used to working with shaders (which I have discovered is very outdated) I kept finding myself honestly amazed that the GPU could do what it was doing).

The following is shader code to get the normal at the point of intersection. I don’t know if it will look simple because of my verbose style, but I did change some of my actual code a little bit to try and shorten it for this post hehe:

// Get the renderable data for this instance
const StructuredBuffer<sRenderableData> renderableDataBuffer =
	ResourceDescriptorHeap[g_constantBuffer_perFrame.index_renderableDataBuffer];
const sRenderableData renderableData = renderableDataBuffer[InstanceID() + GeometryIndex()];
// Get the index buffer of this renderable data
const int indexBufferId = renderableData.indexBufferId;
const Buffer<int> indexBuffer = ResourceDescriptorHeap[
	NonUniformResourceIndex(g_constantBuffer_perFrame.index_indexBufferOffset + indexBufferId)];
// Get the vertex indices of this triangle
const int triangleIndex = PrimitiveIndex();
const int vertexCountPerTriangle = 3;
const int vertexIndex_triangleStart = triangleIndex * vertexCountPerTriangle;
const uint3 indices = {
	indexBuffer[vertexIndex_triangleStart + 0],
	indexBuffer[vertexIndex_triangleStart + 1],
	indexBuffer[vertexIndex_triangleStart + 2]};
// Get the vertex buffer of this renderable data
const int vertexBufferId = renderableData.vertexBufferId;
const StructuredBuffer<sVertexData> vertexData = ResourceDescriptorHeap[
	NonUniformResourceIndex(g_constantBuffer_perFrame.index_vertexBufferOffset + vertexBufferId)];
// Get the vertices of this triangle
const sVertexData vertexData_0 = vertexData[indices.x];
const sVertexData vertexData_1 = vertexData[indices.y];
const sVertexData vertexData_2 = vertexData[indices.z];
// Get the barycentric coordinates for this intersection
const float3 barycentricCoordinates = DecodeBarycentricAttribute(i_intersectionAttributes.barycentrics);
// Interpolate the vertex normals
const float3 normal_triangle_local = normalize(Interpolate_barycentricLinear(vertexData_0.normal, vertexData_1.normal, vertexData_2.normal, barycentricCoordinates));

The highlighted lines show where I am looking up data (this is also how I get the material data, although that code isn’t shown). In lines 2-3 I get the list of what I am currently calling “renderable data”, which is just the smallest unit of renderable stuff (currently a vertex buffer ID, an index buffer ID, and a material ID). In line 4 I get the specific renderable data of the intersection (using built-in HSL functions), and then I proceed to get the indices of the triangle’s vertices, and then the data of those vertices, and then I interpolate the normal of each vertex for the given intersection (I also interpolate the vertex colors the same way although that code isn’t shown). This retrieval of vertex data and then barycentric interpolation feels just like what I am used to with a ray tracer.

The following is most of the shader code used for recursively tracing a new ray to calculate the reflection (I have put ... where code is removed to try and remove distractions):

RayDesc newRayDescription;
{
	const float3 currentRayDirection_world = WorldRayDirection();
	{
		const float3 currentIntersectionPosition_world = WorldRayOrigin() + (currentRayDirection_world * RayTCurrent());
		newRayDescription.Origin = currentIntersectionPosition_world;
	}
	{
		const cAffineTransform transform_localToWorld = ObjectToWorld4x3();
		const float3 normal_triangle_world = TransformDirection(normal_triangle_local, transform_localToWorld);
		const float3 reflectedDirection_world = reflect(currentRayDirection_world, normal_triangle_world);
		newRayDescription.Direction = normalize(reflectedDirection_world);
	}
	...
}
const RaytracingAccelerationStructure rayTraceAcceleration = ResourceDescriptorHeap[
	g_constantBuffer_perFrame.index_rayTraceAccelerationOffset
		+ g_constantBuffer_perDispatch.rayTraceAccelerationId];
...
sRayTracePayload newPayload;
{
	newPayload.recursionDepth = currentRecursionDepth + 1;
}
TraceRay(rayTraceAcceleration, ..., newRayDescription, newPayload);
color_material *= newPayload.color;

A new ray is calculated, using the intersection as its origin and in this simple case a perfect reflection for its new direction, and then the TraceRay() function is called and the result used. Again, this is exactly how I would expect ray tracing to be and it was fun to write these ray tracing shaders and feel like I was putting on an old comfortable glove. (Also again, it feels like living in the future that the GPU can look up arbitrary data and then trace arbitrary rays… amazing!)

The Bad News

Unlike the HLSL shader code, however, the C++ application code was not so straightforward. I should say up front that I am not trying to imply that I think the API has anything wrong with it or is poorly designed or anything like that: The way that it is all makes sense, and there wasn’t anything where I thought something was bad and should be changed. Rather, it’s just that it felt very complex and hard to learn how it all worked (and even now when I have it running and have wrapped it all in my own code I find that it is difficult for me to keep all of the different parts in my head).

I think part of the problem for me personally is that it’s not just DirectX ray tracing (“DXR”) that is new to me, but also DirectX 12 (all of my professional experience is with DX11 and the previous console generation which was DX9-ish). I think that it’s pretty clear that learning DXR would have been much easier if I had already been familiar with DX12, and that a large part of what feels like so much complexity is just the inherent complexity of working with low-level GPU stuff, which is new to me.

Still, it seems safe to assume that many people wanting to learn GPU ray tracing would run into the same difficulties. One of the nice things about ray tracing that I mentioned earlier in this post is how simple it is, but diving into it on the GPU at the pure low level that I did is clearly not a good approach for anyone who isn’t already familiar with graphics. One thing that really frustrated me trying to learn it was that the samples and tutorials from both Microsoft and Nvidia that purported to be “hello world”-type programs still used lots of helper code or frameworks and so it was really hard to just see what functions had to be called (one big exception to this was this awesome tutorial which was exactly what I wanted and much more helpful to me for what I was trying to do than the official Microsoft and Nvidia sample code). I think I can understand why this is, though: If someone isn’t trying to write their own engine explicitly from scratch but instead just wants to trace rays then writing shaders is probably a better use of time than struggling to learn the C++ API side.

What’s Next?

Although mirror reflections are cool-ish, the real magic that ray tracing makes possible is when the reflections are spread out. I don’t have any experience with denoising and so it’s hard for me to even predict what will be possible in real time given the limited number of rays that can be traced each frame, but my dream would be to use ray tracing for nice shadows. Hopefully there will be cooler ray tracing images to show in the future!

Problems Building Shaders

I have been working with my own custom build system, and it has been working really well for me while building C++ code. As I have been adding the capability to build shaders, however, I have run into several problems, some of which I don’t have good solutions for. Most of them are curiously interrelated, but the biggest single problem could be summarized as:

The #include file dependencies aren’t known until a shader is compiled, and there isn’t a great mechanism for getting that information back to the build system

This post will discuss some of the individual problems and potential solutions.

How Dependencies are Handled Generally

I have a scheme for creating files with information about the last successful execution of a task. There is nothing clever or special about this, but here is a simple example:

{
	"version":1,
	"hash":"2ade222a1aff526d0d4df0a2f5849210",
	"paths":[
		{"path":"Engine\\Type\\Arithmetic.test.cpp","time":133783428404663533},
		{"path":"temp\\win64\\unoptimized\\intermediate\\jpmake\\Precompiled\\Precompiled_common.c843349982ad311df55b7da4372daa2d.pch","time":133804112386989788},
		{"path":"C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.41.34120\\bin\\Hostx64\\x64\\cl.exe","time":133755106284710903},
		{"path":"jpmake\\IDE\\VS2022\\temp\\x64\\Release\\output\\jpmake\\jpmake.exe","time":133804111340658384}
	]
}

This allows me to record 1) which files a specific target depends on, 2) what the last-modified times of each file was the last time that the task with the target as an output was executed, and 3) a hash value that contains any other information that should cause the task to be executed again if something changes.

This has worked well for all of the different kinds of tasks that I have added so far and it usually corresponds pretty closely what the user has manually specified. The one big exception has been C++ compilation in which case things are more complicated because of #include directives. Rather than require the user to specify individual files that are #included (and then transitive files that those files #include) the build system instead uses a command argument to have MSVC’s compiler create a file with the dependencies and then that file’s contents are added to the dependency file that my build system uses.

This is possible because I have made building C++ a specialized kind of task where I have implemented everything; this means that the Lua interface that the user (me) worries about is designed to be easy-to-use and then the build system takes care of all of the annoying details.

What about Shaders?

My mental goal has been to design the build system generally so that any kind of task can be defined, and that Lua is used for the project files so that any kind of programming can be done to define these tasks. I always knew that I would make some kind of specialized tasks that could be considered common (copying files is the best example of this), but the aspirational design goal would be to not require specialized tasks for everything but instead let the user define them.

The main motivation behind this design goal was so that “assets” could be handled the same way as code. I wanted to be able to build asset-building programs and then use them to build assets for a game, and I didn’t want to have to create specialized task types for each of these as part of the build system (the build system should be able to be used with any project, with no knowledge about my specific game engine).

Building shaders as the first kind of asset type has revealed a problem that is obvious but that I somehow didn’t fully anticipate: How to deal with implicit dependencies, specifically #include directives, when the task type is just an arbitrary command with arguments that the build system has no specialized knowledge of?

Dependency Callback

Like its C++ cl.exe compiler, Microsoft’s dxc.exe shader compiler can output a file with dependencies (e.g. the #include files). My current scheme is to allow the user to specify a callback that is called after a command execution succeeds and which returns any dependencies that can’t be known until after execution; this means that the callback function can parse and extract the data from the dependencies file that dxc.exe outputs and then report that data back to the build system. Here is currently how that can be done for a single shader:

local task_engineShaders = CreateNamedTask("EngineShaders")
local path_input = ResolvePath("$(engineDataDir)shaders.hlsl")
do
	local dxc_directory = g_dxc_directory
	local path_output_noExtension = ResolvePath("$(intermediateDir_engineData)shader_vs.")
	local path_output = path_output_noExtension .. "shd"
	local path_dependenciesFile = path_output_noExtension .. "dep.dxc"
	task_engineShaders:ExecuteCommand{
			command = (dxc_directory .. "dxc.exe"), dependencies = {(dxc_directory .. "dxcompiler.dll"), (dxc_directory .. "dxil.dll")},
			inputs = {path_input},
			outputs = {path_dependenciesFile},
			arguments = {path_input,
					"-T", "vs_6_0", "-E", "main_vs", "-fdiagnostics-format=msvc",
					"-MF", path_dependenciesFile
				},
			postExecuteReturnDependenciesCallback = ReturnShaderDependencies, postExecuteReturnDependenciesCallbackUserData = path_dependenciesFile
		}
	local path_debugSymbols = path_output_noExtension .. "pdb"
	local path_assembly = path_output_noExtension .. "shda"
	task_engineShaders:ExecuteCommand{
			command = (dxc_directory .. "dxc.exe"), dependencies = {(dxc_directory .. "dxcompiler.dll"), (dxc_directory .. "dxil.dll"), path_dependenciesFile},
			inputs = {path_input},
			outputs = {path_output},
			arguments = {path_input,
					"-T", "vs_6_0", "-E", "main_vs", "-fdiagnostics-format=msvc",
					"-Fo", path_output, "-Zi", "-Fd", path_debugSymbols, "-Fc", path_assembly,
				},
			artifacts = {path_debugSymbols},
			postExecuteReturnDependenciesCallback = ReturnShaderDependencies, postExecuteReturnDependenciesCallbackUserData = path_dependenciesFile
		}
end

The first problem that is immediately noticeable to me is that this is a terrible amount of text to just build a single shader. We will return to this problem later in this post, but note that since this is Lua I could make any convenience functions I want so that the user doesn’t have to manually type all of this for every shader. For the sake of understanding, though, the code above shows what is actually required.

The source path is specified on line 2, the compiled output path on line 6, and then the dependencies file path that DXC can output on line 7. Two different commands are submitted for potential execution, on lines 8 and 20, and then line 29 shows how the callback is specified (there is a function and also some “user data”, a payload that is passed as a function argument).

Why are there two different commands? Because DXC requires a separate invocation in order to generate the dependency file using -MF. I found this GitHub issue where this is the behavior that was specifically requested (although someone else later in the issue comments writes that they would prefer a single invocation, but I guess that didn’t happen). This behavior is annoying for me because it requires two separate sub tasks, and it is kind of tricky to figure out the dependencies between them.

Here is what the callback function looks like:

do
	local pattern_singleInput = ":(.-[^\\]\r?\n)"
	local pattern_singleDependency = "%s*(.-)%s*\\?\r?\n"
	local FindAllMatches = string.gmatch
	--
	ReturnShaderDependencies = function(i_path_dependenciesFile)
		local o_dependencies = {}
		-- Read the dependencies file
		local contents_dependenciesFile = ReadFile(i_path_dependenciesFile)
		-- Iterate through the dependencies of each input
		for dependencies_singleInput in FindAllMatches(contents_dependenciesFile, pattern_singleInput) do
			-- Iterate through each dependency of a single input
			for dependency in FindAllMatches(dependencies_singleInput, pattern_singleDependency) do
				o_dependencies[#o_dependencies + 1] = dependency
			end
		end
		return o_dependencies
	end
end

First, the good news: It was pretty easy to write code to parse the dependencies file and extract the desired information, which is exactly what I was hoping for by using Lua.

Unfortunately, there is more bad news than good.

The reason that there is a separate callback and then user data (rather than using the dependencies file path as an upvalue) is so that there can be just a single function, saving memory. I also discovered a way to detect if the function had changed using the lua_dump() function, and this seemed to work well initially. I found, however, that lua_dump() doesn’t capture upvalues (which makes sense after reading about it, because upvalues can change), which means that there is a bug in the callback I show above: If I change the patterns on line 2 or line 3 it won’t trigger the task to execute again because I don’t have a way of detecting that a change happened. In this specific case this is easy to fix by making the strings local to the function (and, in fact that’s probably better anyway regardless of the problem I’m currently discussing), but it is really discouraging to realize that there is this inherent problem; I don’t want to have to remember any special rules about what can or can’t be done, and even worse I don’t want dependencies to silently not work as expected if those arcane rules aren’t followed.

There is another fundamental problem with using a Lua function as a callback: It means that there must be some kind of synchronization between threads. The build system works as follows:

  • Phase 1: Gather all of the info (reading the Lua project file and storing all of the information needed to determine whether tasks need to be executed and to execute them)
    • During this phase all data is mutable, and everything is thus single-threaded
  • Phase 2: Iterate through each task, deciding whether it must be executed and executing it if necessary
    • During this phase all data is supposed to be immutable, and everything can thus be multi-threaded
    • (As of this post everything is still single-threaded because I haven’t implemented different threads yet, but I have designed everything to be thread safe looking towards the future when tasks can be executed on different threads simultaneously)

The problem with using Lua during Phase 2 is that it violates these important design decisions. If Phase 2 were multi-threaded like it is intended to be then there would be a bad bug if two different threads called a Lua function at the same time. I don’t think that the problem is insurmountable: There is a lua_newthread() function that can be used to have a separate execution stack, but even with that there are some issues (e.g. if I assume that all data is immutable then I could probably get away without doing any locking/unlocking in Lua, but I don’t have any way of enforcing that if I allow arbitrary functions to be used as callbacks (a user could do anything they wanted, including having some kind of persistent state that gets updated every time a Lua function is called), which again involves arcane rules that aren’t enforceable but could cause bad bugs).

What I really want is to allow the user to use as much Lua as they want for defining tasks and dependencies in Phase 1, but to never use it again once Phase 2 starts. But, even though that’s the obvious ideal situation, how is it possible to deal with situations where dependencies aren’t known until execution and they must be reported?

Possible Solutions

Unfortunately, I don’t have good solutions yet for some of these problems. I do have a few ideas, however.

Automatic Dependencies

My dream change that would help with a lot of these problems would be to have my build system detect which files are read when executing a task so that it can automatically calculate the dependencies.

Look again at one of the lines where I am defining the shader compilation task:

command = (dxc_directory .. "dxc.exe"), dependencies = {(dxc_directory .. "dxcompiler.dll"), (dxc_directory .. "dxil.dll")},

The user specifies dxc.exe, which is good because that’s the program that must run. But then it’s frustrating that dxcompiler.dll and dxil.dll also have to be specified; I only happen to know that they are required because they come bundled with dxc.exe, but how would I know that otherwise? There are certainly other Windows DLLs that are being used by these three that I don’t have listed here, and why should I have to know that? Even if I took the time to figure it out a single time is it also my responsibility to track this every time that there is a DXC update?

There is a build system named tup that takes care of figuring this out automatically, but although I have looked into this several times it doesn’t seem like there is an obvious way to do this (at least in Windows). I think the general method that would be required is 1) create a new process but with its main thread suspended, 2) hook any functions in that process that are relevant, 3) resume the main thread. The hooking step, though, seems tricky. It also seems difficult to differentiate between file accesses that should be treated as dependencies and ones that are outputs; I think maybe if there is any kind of write permission then it could be treated as not-a-dependency, but I haven’t fully thought that through.

I have looked into this automatic dependency calculation several times and every time given up because it looks like too big of a task. If I were working full time on the build system I think this is what I would do, but since my main goal currently is the game engine it has been too hard for me to justify trying to make this work. Having said that, it seems like the only way to handle these problems that would satisfy me as a user, and so maybe some day I will dive in and try to make it work.

Two Invocations of DXC.exe

Without automatic dependency calculation then it seems required to have two different invocations of dxc.exe. The solution to this from the standpoint of a nice build project file seems, at least, pretty obvious: Write a custom tool with my own command arguments which could internally execute dxc.exe twice, but that would only be run once from the perspective of the build system. Having a wrapper tool like this would also mean that I could create my own command arguments, with a related benefit that the same program and same arguments could be called for any platform (and the wrapper tool could then do the appropriate thing).

Avoiding Lua Callbacks

One idea I have had is that an execute-command task could be required to create its own dependency file. If I am already going to write some kind of shader compiler wrapper program around dxc.exe then I could also have it load and parse the dependency file and output a different dependency file in the format that my build system wants. This has some clear disadvantages because it means a user is expected to write similar wrapper tools around any kind of command that they want to execute, which is obviously a huge burden compared to just writing a Lua function. I think the advantage is that it would avoid all of the potential problems with violating confusing rules about what can or can’t be done in a Lua function in my build system, and that is attractive to avoid silent undetectable bugs, but it certainly doesn’t fit in with the ergonomics that I imagined.

I need to think about this more and hope that I come up with better ideas.

Specialized Shader Tasks

A final solution would be for me to just make a specialized task type for building shaders, just like I have for C++. Then I could write it all in C++ code, try to make it as efficient as possible, and it would accomplish the same end goal of letting me specify configuration and making it cross platform just like a separate “wrapper” program would do, but it would be part of the build system itself (rather than requiring any user of the build system to do the same thing). This avoids the fundamental problems with arbitrary commands but those problems will eventually have to be solved for other asset types; I think it might be worth doing, however, because shaders are such an important kind of asset.

The big reason I hesitate to do this (besides the time that it would take to do) is that there is a lot of custom code that can be built up around shaders. My engine is still incredibly basic and so I have just focused on compiling a hand-written shader source file, but eventually (if I have time) the shader source files will be generated somehow, and there could well be a very involved process for this and for getting reflection data and for generating accompanying C++ code. Knowing this and thinking about past systems that I have built at different jobs I am not sure if it makes sense to spend time making a system focused on dxc.exe that I might not end up using; maybe I will end up with a “wrapper” program anyway that does much more than just compiling shaders.

Update: Improving User API with Lua

I alluded to this in the original post but didn’t give an example. Since Lua is used to define how things get built it is possible to hide many of the unpleasant boilerplate details of a task (e.g. a task to build shaders) behind nice abstractions. The following is an example of something that I came up with:

local task_engineShaders = CreateNamedTask("EngineShaders")
local path_input = "$(engineDataDir)shaders.hlsl"
local path_output_vs, path_output_ps
do
	local entryPoint = "main_vs"
	local shaderType = "vertex"
	local path_intermediate_noExtension = "$(intermediateDir_engineData)shader_vs."
	path_output_vs = path_intermediate_noExtension .. "shd"
	BuildShader(path_input, entryPoint, shaderType, path_output_vs, path_intermediate_noExtension, task_engineShaders)
end
do
	local entryPoint = "main_ps"
	local shaderType = "pixel"
	local path_intermediate_noExtension = "$(intermediateDir_engineData)shader_ps."
	path_output_ps = path_intermediate_noExtension .. "shd"
	BuildShader(path_input, entryPoint, shaderType, path_output_ps, path_intermediate_noExtension, task_engineShaders)
end

This code builds two shaders now (a vertex and a pixel/fragment shader), compared to the single shader shown previously in this post, but it is still much easier to read because most of the unpleasantness is hidden in the BuildShader() function. This function isn’t anything special in the build system but is just a Lua function that I defined in the project file itself. This shows the promise of why I wanted to do it this way, because it makes it very easy to build abstractions with a fully-powered programming language.

Frame Pacing with Simulation and Input

This post expands on ideas that were introduced in the Frame Pacing in a Very Simple Scene post, and that earlier post should probably be read and understood before this one in order to have context into what the profiler screenshots mean.

To skip to the images find the Visualizing in a Profiler section.

To skip to the sample programs find the Example Programs section or download them here.

In the previous post I had used the Tracy profiler to help visualize what my rendering code was doing; specifically I was interested in scheduling when CPU code would generate commands for the GPU to execute to make sure that it was happening when I wanted it to. I raised the question of why not generate GPU commands even earlier to make sure that they were ready in time and discussed why doing so would work with predetermined content (like video, for example), but would be less ideal for an interactive experience (like a game) because it would increase latency between the user providing input and the results of that input being visible on the display.

This post will show the work I have done to implement some user input that can influence the game simulation, and how that interacts with rendering frames for the display.

How to Coordinate Display Frames with Simulation Updates

A traditional display changes the image that it displays at a fixed rate, called the “refresh rate”. As an example, the most common refresh rate is 60 Hz, where “Hz” is the symbol for Hertz, a unit of frequency, which means that a display with a fixed refresh rate of 60 Hz is changing the image that it displays 60 times every second. This means that a program has to generate a new image and have it ready for the display 60 times every second. There is a fixed deadline enabled by something called vsync, and this puts a constraint on how often to render an image and when. (For additional details about dealing with vsync and timing, see the post Syncing without Vsync. For additional details about how to schedule the generation of display frames for vsync see the post that I have already referred to, Frame Pacing in a Very Simple Scene.)

Just because the display is updating at some fixed rate, however, doesn’t mean that the rest of the game needs to update at that same fixed rate, and it also doesn’t mean that the rest of the game should update at that same fixed rate. There is a very well-known blog post that discusses some of the reasons why, Fix Your Timestep!, which is worth reading if you haven’t already. In my own mental model the way that I think about this is that I want to have an interactive experience that is, conceptually, completely decoupled from the view that I generate in order to let the user see what is going on. The game should be able, conceptually, to run on its own with no window, and the player could also pick up a controller and influence that game, and whatever happens should happen the same way regardless of what it would look like to the player if there were a window (and regardless of what refresh rate, resolution, or anything else the player’s display happens to have).

My challenge since the last post, then, has been to figure out how to actually implement this mental model.

Display Loop

I am trying to design my engine to be platform-independent but the only actual platform that I am programming for currently is Windows; this means that behind all of the platform-independent interfaces there is only a single platform-specific implementation and it is for Windows. On Windows there is a very tight coupling between a window and a D3D12 swap chain, and there is also a very tight coupling between any window and the specific thread where it was created. Because of this, my current design is to have what I am calling the “display thread”, and this thread is special because it handles the main window event queue (meaning the events that come from the Windows operating system itself). This is also the thread where I submit the GPU render commands and requests to swap (for the swap chain) and that means that there is a built-in cadence in this thread that matches the fixed refresh rate.

What I needed to figure out was how to allow there to be a separate fixed update rate for the game simulation, which could either be faster or slower than the display refresh rate, and how to coordinate between the two. The strategy that was initially obvious to me was to have a separate thread that handled the simulation updates and then to have the display thread be able to coordinate with this simulation thread. I think that this would work fine (many games have a “game thread” and a “render thread”), but there is an aspect of this design that made me hesitate to implement it this way.

My impression is that over the past several years games have been moving away from having fixed threads with specific responsibilities and instead evolving to be more flexible where “jobs” or “tasks” can be submitted to any available hardware threads. The benefit of this approach is that it scales much better to any arbitrary number of threads. I have read about engines that don’t have a specific “game thread” or “render thread”, and the idea is appealing to me on a conceptual level. Although I don’t have any job system in place yet I feel like this is the direction that I at least want to aspire to, and that meant that I had to avoid my initial idea of having a separate simulation thread with its own update cadence.

Instead, I needed to come up with a design that would allow something like the following:

  • It can be calculated which display frame needs to be rendered next, and what the time deadline is for that to be done
  • Based on that, it can be calculated which simulation updates must happen in order to generate that display frame
  • Based on that, some threads can update the simulation accordingly
  • Once that is done some threads can start to generate the display frame

This general idea gave me a way to approach implementing simulation updates at some fixed rate that coordinated with display updates at some other fixed rate, even though I only have a single thread right now and no job system. Or, said another way, if I could have different multiple simultaneous (but arbitrary and changeable) update cadences in a single thread then it might provide a natural way forward in the future to allow related tasks to run on different threads without having to assign specific fixed responsibilities to specific fixed threads.

Interpolation or Extrapolation?

There is something tricky that I don’t address in the simplified heuristic above: How can it be known which simulation updates must be done before a display frame can be generated? At a high level there are three options:

  1. Make the simulation and display tightly coupled. I already discussed why this is not what I want to do.
  2. Simulate as much as possible and then generate a display frame that guesses what will happen in the future. This involves taking known data and extrapolating what will happen next.
  3. Simulate as much as possible and then generate a display frame of what has already happened in the past. This involves taking known data and interpolating what has already happened.

For months I have been fascinated with #2, predicting what will happen in the future and then extrapolating the rendering accordingly. I am not personally into fighting games but somewhere I heard or read about “rollback netcode” and then started reading and watching as much as I could about it. Even though I have no intention of making a network multiplayer game I couldn’t shake the thought that the same technique could be used in a single-player game to virtually eliminate latency. I ultimately decided, however (and unfortunately), that this wasn’t the way to go because I wasn’t confident in my ability to deal with mispredictions in a way that wouldn’t be distracting visually. My personal priorities are for things to look good visually and less about reducing latency or twitch gaming, which means that if there is a tradeoff to be made it will usually be to increase visual quality. (With that being said, if you are interested I think that my single favorite source that I have found about these ideas is https://www.youtube.com/watch?v=W3aieHjyNvw. Maybe at some point in the future I will have enough experience to try tackling this problem again.)

That leaves strategy #3, interpolation. This means that in order to start generating a frame to display I can only work with simulation updates that have already happened in the past, and that I need to figure out when these required simulation updates have actually happened so that I can then know when it is ok to start generating a display frame.

Below is a capture I made showing visually what I mean by interpolation:

The two squares are updating at exactly the same time (once each second), and they have identical horizontal positions from the perspective of the simulation. The top square, however, is rendering with interpolation such that every frame that is generated for display places the square where the simulation would have been at some intermediate point between discrete updates. This gives the illusion of smooth motion despite the simulation only updating once each second (the GIF makes the motion look less smooth, but you can download and try one of the sample programs to see how it looks when interpolating at your display’s actual refresh rate).

Notice in the GIF how this interpolation scheme actually increases latency: The bottom square shows the up-to-date simulation, and the top square (what the user actually sees) is showing some blended state between the most up-to-date position and then the previous, even older, position. The only time that the user sees the current state of the simulation is at the moment when the simulation is the most out-of-date, immediately before it is about to be updated again. For a slow update rate like in the GIF this delay is very noticeable, but on the opposite extreme if the simulation is updating faster than the display rate then this is less of a problem. You can download and try the sample programs to find out how sensitive you are to this latency and how much of a difference you can personally notice between different simulation update rates.

The (Current) Solution

I have come up with two variables that determine the behavior of a simulation in my engine and how it gets displayed:

  • The (fixed) duration of a simulation update, in milliseconds
  • The (fixed) cutoff time before a display refresh, in milliseconds, when a display frame will start to be generated

This was a surprisingly (embarrassingly!) difficult problem for me to solve especially given how simple the solution seems, and I won’t claim that it is the best solution or that I won’t change the strategy as time goes on and I get more experience with how it works. Let me elaborate on the second point, though, since that was the tricky part (the first point is obvious and so there is not much to discuss).

I knew that I wanted the simulation to run independently of the display. Said another way, the simulation shouldn’t even know or care that there are any frames being displayed, and the framerate of the display should never have any influence on the simulation. Note that this also extends to user input: The same input from a user should have the same effect on the simulation regardless of what framerate that user’s display has.

On the other hand, it seemed like it was necessary to have the display frames have some kind of dependency on the simulation: In order to interpolate between two simulation frames there has to be some kind of offset involved where the graphics system knows how far between two frames it is. My brain kept thinking that maybe the offset needed to be determined by how long the display’s refresh period is, but I didn’t like that because it seemed to go against my goal of different players having the same experience regardless of refresh rate.

Finally I realized that it could be a (mostly) arbitrary offset from the vertical blank. What this represents conceptually is a tradeoff between latency (a smaller time is better) and computational requirements (it must be long enough that the CPU and GPU can actually generate a display frame before the vblank happens). Having it be a fixed offset, however, means that it will behave mostly the same regardless of what the display is doing. (There is still going to be some system-dependent latency: If DWM is compositing then there is an inherent extra frame of latency, and then there is some unpredictable time from when the GPU scans the frame out until it is actually visible on the display which is dependent on the user’s actual display hardware and how it is configured. What this current solution seems to do, however, is to minimize the effects of display refresh rate in the software domain that I have control over.)

Visualizing in a Profiler

In order to understand how the simulation is scheduled relative to rendering we can look at Tracy profile captures. This post won’t explain the basics of how to interpret these screenshots, but you can refer to this previously-mentioned earlier post for a more thorough explanation.

In all of the screenshots in this post my display refresh rate is set to 60 Hz which means that a generated image is displayed for 16.67 milliseconds (i.e. 1000 / 60 ms). I have set the time before the vblank to start rendering to be half of that, 8.33 ms. Note that this doesn’t mean that this time is in any way meaningful or that it is ideal! Instead, I have deliberately chosen this time to make it easier for you as a reader to envision what is happening; the cutoff for when input for simulation frames is no longer accepted is always going to be at the halfway point between display frames.

Remember: The cutoff for user input in these screenshots is arbitrarily chosen to be exactly halfway between display frames.

100 Hz Simulation

This first example has the simulation updating at 100 Hz, which means that each simulation update represents 10 ms (i.e. 1000 / 10):

As in the previous post, the DisplayFrontBuffer zones show when the display is changing, and it corresponds to the unnamed “Frames”, which each take 16.67 ms.

This shows where the simulation frames are happening, and each lasts roughly 10 ms. In the previous two screenshots I have the default display frames selected, but Tracy also allows the other frames to be selected in which case it will show those frames at the top:

Each display frame takes 16.67 ms and each simulation frame takes 10 ms, and since these are not clean multiples the way that they line up is constantly varying. The following facts can help to understand and interpret these profile screenshots:

  • The display frames shown are determined by a thread waiting for a vblank, and so this cadence tends to be quite regular and close to the ideal 16.67 ms
  • The simulation frames shown are determined by whenever a call to Windows’s Sleep() function returns, and more specifically whenever a call to Sleep() returns that is after the required cutoff for input. This means that the durations shown are much more variable, and although helpful for visualization they don’t actually show the real simulation frames
    • Sleep() is notoriously imprecise; I have configured the resolution to be 1 ms, and newer versions of Windows have improved the call to behave more closely to how one might expect (see also here), but it is still probably not the best tool for what I am using it for.
    • For the purposes of this post you just need to keep in mind that the end of each simulation updates shown in the screenshots is conservative: It is always some time after the actual simulation update has ended, but how long after is imprecise and (very) inconsistent.
  • The simulation frames shown don’t include any work done to actually update the simulation. Instead they represent the timeline of logical simulation updates; one way to think of them is the window of time when the user can enter input to influence the following simulation update (although this mental model may also be confusing because it hints at additional latency between the user’s input and that input being used; we will ignore that for this post and focus more the relationship between the simulation and the display but it will be something that I will have to revisit when I work on input).

The blue swap chain texture is currently being displayed (i.e. it is the front buffer), and so the yellow swap chain texture must be written to in order to be displayed next (i.e. it is the back buffer). The purple WaitForSimulationInputCutoff zones show how the 10 ms simulation updates line up with the 16.67 ms display updates. The cutoff that I have specified is 8.83 ms before the vblank, and so the last WaitForSimulationInputCutoff that we can accept for rendering must finish before the halfway point of that blue DisplayFrontBuffer zone. In this example that I have chosen the simulation frame ends a little bit before the halfway point, and then the work to generate a new display frame happens immediately. Here is a zoomed in version of that single display frame to help understand what is happening:

After the purple WaitForSimulationInputCutoff is done the CPU work to record and submit GPU commands to generate a display frame happens. In the particular frame shown in the screenshot there is no need to wait for anything (the GPU finished the previous commands long ago, and the yellow back buffer is available to be changed because the blue front buffer is being shown which means that recording and submission of commands can happen immediately). The GPU also does the work immediately (the little green zone above the blue DisplayFrontBuffer), and everything is ready to be displayed for the next vblank, where the yellow texture can now be swapped to become the front buffer.

The situation in this screenshot is probably the easiest one to understand: The yellow rendering work is generating a display frame about 8.33 ms before it has to be displayed. That display frame will show an interpolation between two simulation frames, and the last of those two simulation frames ended where that yellow rendering work begins.

The yellow rendering work is now at the right of the screenshot (look at Frame 278 at the top to help orient yourself), and there are two purple WaitForSimulationInputCutoff zones highlighted. The yellow rendering work at the right of the screenshot is generating a display frame that contains an interpolation of some point in time between the first WaitForSimulationInputCutoff zone and the second WaitForSimulationInputCutoff zone.

Now the yellow texture is being displayed as the front buffer and the blue texture is the back buffer that must be modified. Because of the way that two different 16.67 ms and 10 ms cadences line up, however, the last simulation frame that can be used for this display frame ends relatively earlier, which means that rendering work can begin relatively earlier (note how the blue CPU work happens earlier in the display frame in the highlighted Frame 279 compared to where the yellow CPU work happened in the preceding Frame 278 that we looked at previously). The way that I think about this is to ask “when will the next simulation frame end?” In this case it would end after the 8.33 ms cutoff (at the midpoint of yellow Frame 279), and so it can’t be used for this display frame.

The ending simulation frame is new, but the beginning simulation frame is the same one that was used as the ending simulation frame for the previous display frame. These examples hopefully give an idea of how the display frames are being generated independently of the cadence of the simulation frames, and how they are showing some “view” into the simulation by interpolating between the two most recently finished simulation frames. The code uses the arbitrary cutoff before the vblank (8.33 ms in these screenshots) in order to decide 1) what the two most recently finished simulation frames are and 2) how much to interpolate between the beginning and ending simulation frames for a given display frame.

When interpolating, how can the program determine how far between the beginning simulation frame and the ending simulation frame the display frame should be? The amount to interpolate is based on how close the display frame cutoff is to the end of the simulation frames. This can be confusing for me to think about, but the key insight is that it is dependent on the simulation update rate. The way that I think about it is that I take the cutoff in question (which, remember, happens after the ending simulation frame) and then subtract the duration of a simulation update from that. Whatever that time is must fall between the beginning and ending simulation update, and thus determines the interpolation amount.

I’m not sure whether to spend more time explaining that or whether it’s obvious to most people and possibly just confusing to me for some reason. I think that what makes it tricky for me to wrap my head around is that the interpolation is between two things that have both happened in the past (whereas if I were instead doing extrapolation with one update in the past and one in the future then it would be much more natural for how my brain tries to think about it). Rather than belaboring the point I will repeat the GIF of the two moving squares, which I think is an easier way to understand visually how the interpolation is working:

Let’s look again at the two simulation frames that each display frame uses, one after the other:

It is hard to say anything very precise about the interpolation amount because of the imprecision of the purple WaitForSimulationInputCutoff zones (because of the inherent imprecision of the Sleep() function), but we can at least make some generalizations using some intuitive reasoning:

  • The earlier that the ending simulation update happens relative to the display frame cutoff, the more the interpolation amount will be biased towards the ending simulation update
  • The closer that the ending simulation update happens relative to the display frame cutoff, the more the interpolation amount will be biased towards the beginning simulation update

With that in mind, in the first screenshot we can say that the yellow display frame work (which will be shown in Frame 279) is going to show something closer to the beginning simulation update (the first one highlighted in green); we can claim this because the ending simulation update ended so close to the 8.33 ms cutoff (halfway between the blue DisplayFrontBuffer zone), and if we subtract the 10 ms simulation update duration it is going to be pretty close to end of the beginning simulation update.

In the second screenshot, by comparison, we can say that the blue display frame work (which will be shown in Frame 280) is going to show more of the ending simulation update (the second one highlighted in green); we can claim this because the ending simulation updated ended quite some time before the 8.33 ms cutoff.

Before moving on to a different simulation update rate let’s look one more time at the original screenshot that shows many display and simulation frames:

This screenshot shows how the two different frame cadences line up at different points. It also shows how the rendering work starts being done as soon as it is possible, but that it isn’t possible until after the end of the last simulation update before the cutoff (again, remember that the cutoff is arbitrary but was chosen to be 8.33 ms so that it is easy to visualize as being at the halfway point of each display frame).

250 Hz

This next example has the simulation updating at 250 Hz, which means that each simulation update is 4 ms (i.e. 1000 / 250):

There are now more simulation updates for every display update (about 4, i.e. 16.67 ms / 4 ms), but otherwise this profile should look familiar if you were able to follow the previous discussion about 100 Hz. One interesting pattern is that the smaller the duration of simulation updates the more consistently closer to the halfway cutoff the display frame render work will be (you can imagine it as taking a limit as simulation updates get infinitesimally small, in which case the display frame render work would conceptually never be able to start early).

A high frequency of simulation updates like this would be the ideal way to run. The visual representation of the simulation would still be limited by the display’s refresh rate but the simulation itself would not be. Of course, the simulation is still constrained by how computationally expensive it is to update. In my example I’m just updating a single position using velocity and so it is no problem, but in a real game the logic and physics will take more time. You can download and try an example program to see how it feels to have the square’s position update at 250 Hz.

(I have also found that there are many more visual glitches/hitches running at this high frequency, meaning that display frames are skipped for various reasons. In my particular program this might be my fault, since the rendering work seems so simple (if I used a busy wait instead of Sleep(), for example, I might get better results), but it generally suggests that it probably isn’t possible to have such low latency with such a high refresh rate in a non-toy program. If I increase the cutoff time to give my code some buffer for unexpectedly long waits then things work much better, even at a high simulation update frequency.)

10 Hz

The next example has the simulation updating at 10 Hz, which means that each simulation update is 100 ms (i.e. 1000 / 10). Notably, this is a case of the opposite situation as the previous example: Rather than having many simulation updates for one display update this example has many display updates for one simulation update:

This example looks quite a bit different from the preceding two because the simulation update duration is so much longer than the display update duration. You can see that there are still purple WaitForSimulationInputCutoff zones, but they only occur when there is an actual need to wait, and that only happens when a simulation update is going to end during a display refresh period.

Note that the simulation update itself is quite long, and the purple WaitForSimulationInputCutoff zone only happens at the end. If a display frame must be generated but the two required simulation updates that it is going to interpolate between have already happened then there is no need to wait for another one.

On the other hand, we do see new waits that are much more obvious than they were in the previous examples. These are the reddish waits, where the most obvious one is WaitForSwap_forSwapQueueSpot. I apologize for the awkward name (I have been trying different things, and I’m not in love with this one yet), but it is waiting for the currently-displayed front buffer to be swapped so that it no longer displayed and can be modified as a back buffer. Rendering happens so quickly that a very large percentage of the display frame is spent waiting until the back buffer texture is available, after which (when the DisplayFrontBuffer color switches) some CPU work and then GPU work is done very quickly, and the waiting starts again (for additional information see the previous post that I have already referenced, although it was just called “WaitForSwap” there). In a real renderer with real non-trivial work to be done there would be less waiting, but that is dependent on how long it takes to record GPU commands (before the vblank) and how long it takes to submit and execute those commands (after the vblank).

Although it is hard to see any details it is also instructive to zoom out more and see what it looks like over the course of several simulation updates:

There should be 6 display frames for every simulation update, and this can be seen in the screenshot.

This 10 Hz rate of simulation update is too slow to be satisfying; you can download a sample program and try it to see what it feels like, but it is pretty sluggish. Still, it is good news that this works because it is definitely possible that a program might update its display faster than its simulation and so trying an extreme case like this is a good test.

60 Hz

This next example has the simulation updating at 60 Hz, which means that each simulation update is 16.67 ms (i.e. 1000 / 60). More notably, it means that the simulation is updating at exactly the same rate as the display:

The display frame cutoff is still 8.33 ms, halfway between vblanks, but since the simulation and display update at the same time the simulation update ends close to a vblank and so the display frame work is able to happen early, right after a vblank.

(Since the simulation is happening on the CPU there is probably some small real-world difference between the CPU clock and the display refresh rate which has its own display hardware clock.. If I let this program run long enough there is probably some drift that could be observed and eventually the simulation updates would probably end closer to the cutoff (and then drift again, kind of going in and out of phase). Relatedly, I could probably intentionally make it not line up when the program starts if I put some work into it; the alignment is most likely an accidental result of how and when I am starting the simulation which is probably coincident with some display frame.)

30 Hz

This next example has the simulation updating at 30 Hz, which means that each simulation update is 33.33 ms (i.e. 1000 / 30), and also that the display updates twice for every single simulation update:

This shows what we would have expected: There are two display frames for every one simulation update, and that means that there is only a purple WaitForSimulationInputCutoff zone in the display frame where a simulation update is ending before the halfway 8.33 ms cutoff.

The previous pattern suddenly changes, and there is an unusually long WaitForSimulationInputCutoff zone. It kind of looks like there was a missed display frame (which can happen, although I haven’t shown it in this post; the program can fail to render a display frame in time and then needs to recover), but if we look at the pattern of alternating blue and yellow in the Main thread this isn’t actually the case.

If you look at Simulation 152 and how it lines up with the work in the Main thread you may be able to figure out what is going on. This is an example of things (expectedly) going out of phase, and the code (correctly) dealing with it. If you compare Simulation 152 to the preceding Simulation 151 and subsequent Simulation 153 you can see that nothing unusual happened, and the simulation updates are happening approximately every 33.33 ms, as expected. What is unusual, however, is that Simulation 152 is ending right before the halfway cutoff for simulation updates, and so that yellow rendering work (for display frame 309) can’t start until Simulation 152 has finished. In the context of this screenshot it looks very scary because there is only 8.33 ms for the entire display frame to be generated, but this is the exact cutoff that I have chosen, meaning that I have told the program “don’t worry, I can do everything I need to in 8.33 ms and so use that as the cutoff for user input”. All of the other display frames in the screenshot have much more time (they have to wait for the front buffer to be swapped before they can even submit GPU commands), but this particular instance is doing exactly what I have told it to do, and is just a curious result of the different clocks of the CPU and display not being exact (which, again, is expected). (Notice that immediately after the yellow work with the tight deadline there is blue work, which puts the graphics system back on the more usual schedule where it is done ahead of time and can wait.)

Example Programs

I have made some example programs that have different simulation update rates but are otherwise identical. You can download them here.

Running one of the sample EXEs will, if it doesn’t crash 🙃, show a window that looks like this:

You can move the square using the arrow keys.

The square always moves in discrete steps, at whatever frequency the EXE filename indicates. Visually, however, the displayed frames use interpolation, as discussed in this post. (There are two alternate versions of the slowest update frequencies that don’t do any interpolation; these can be useful to run and compare with the versions that do use interpolation in order to better understand what is happening and why the slow frequencies feel kind of weird and sluggish, even though they look smooth.)

The cutoff before the vblank is set to 10 ms. This is slightly longer than all of the screenshots used in this post (that used 8.33 ms), but it is still small enough that glitches aren’t unusual at the higher simulation update frequencies. In a real program that is doing serious rendering work more time would probably be needed, but for the purposes of these demonstrations I wanted to try and minimize latency.

These programs can be fun to run and see how sensitive you personally are to latency. Try not to focus on any visual hitches (where e.g. a frame might be skipped), but instead try to concentrate on whether it feels like the square responds instantly to pressing a key or releasing it, or whether it seems like there is a delay. Some things to try:

  • Really quick key presses. I generally use the same direction and then tap a rapid staccato pattern. This is what I personally am most sensitive to where the higher frequencies feel better.
  • Holding down a key so the square moves with some constant velocity and then releasing it. I personally am not bothered as much by a delayed start (I know that some people are) as I am by a delayed stop (where the square keeps moving after I’ve let go of a key).
  • At the lowest frequencies (5 Hz and 10 Hz) move the square close to the edge of the window and then try to tap quickly to move it in and out, meaning where a grey border is visible and when it isn’t. If you do this using the versions without interpolating it becomes pretty easy to intuitively understand how the square is moving in discrete steps (because there is a fixed velocity used with the fixed simulation timesteps), and then if you try the same thing with interpolation you can observe the same thing. Even though it looks smooth the actual behavior is tied to a grid, both spatially and temporally, and that’s a large part of why it feels weird (and why sometimes one key press might feel better or worse than another key press, depending on how much that key press aligns with the temporal “grid”).

Frame Pacing in a Very Simple Scene

I have recently integrated the Tracy profiler into my engine and it has been a great help to be able to visualize how the CPU and GPU are interacting. Even though what is being rendered is as embarrassingly simple as possible there were some things I had to fix that weren’t behaving as I had intended. Until I saw the data visualized, however, I wasn’t aware that there were problems! I have also been using PIX for Windows, NSight, RenderDoc, and gpuview, but Tracy has really been useful in terms of presenting the information across multiple frames in a way that I can customize to see the relationships that I have wanted to see. I thought that it might be interesting to post about some of the issues with screenshots from the profiler while things are still simple and relatively easy to understand.

Visualizing Multiple Frames

Below is a screenshot of a capture from Tracy:

I have zoomed in at a level where 5 full frames are visible, with a little bit extra at the left and right. You can look for Frame 395, Frame 396, Frame 397, Frame 398, and Frame 399 to see where the frames are divided. These frame boundaries are explicitly marked by me, and I am doing so in a thread dedicated to waiting for IDXGIOutput::WaitForVBlank() and marking the frame; this means that a “frame” in the screenshot above indicates a specific frame of the display’s refresh cycle.

There is a second frame visualization at the top of the screen shot where there are many green and yellow rectangles. Each one of those represents the same kind of frames that were discussed in the previous paragraph, and the purple bar shows where in the timeline I am zoomed into (it’s hard to tell because it’s so small but there are 7 bars within the purple section, corresponding to the 1 + 5 + 1 frames visible at the level of zoom).

In addition to marking frames Tracy allows the user to mark what it calls “zones”. This is a way to subdivide each frame into separate hierarchical sections in order to visualize what is happening at different points in time during a frame. There are currently three threads shown in the capture:

  • The main thread (which is all that my program currently has for doing actual work)
  • An unnamed thread which is the vblank heartbeat thread
  • GPU execution, which is not a CPU thread but instead shows how GPU work lines up with CPU work

In order to try and help me make sure that I was understanding things properly I have color coded some zones specifically according to which swap chain texture is relevant. At the moment my swap chain only has two textures (meaning that there is only a single back buffer at any one time and the two textures just get toggled between being the front buffer or back buffer any time a swap happens) and they are shown with DarkKhaki and SteelBlue. In the heartbeat thread the DisplayFrontBuffer zone is colored according to which texture is actually being displayed during that frame (actually this is not true because of the Desktop Windows Manager compositor, but for the purposes of this post we will pretend that it is true conceptually).

I have used the same colors in the main CPU thread to show which swap chain texture GPU commands are being recorded and submitted for. In other words, the DarkKhaki and SteelBlue colors identify a specific swap chain texture, the heartbeat thread shows when that texture is the front buffer, and the main thread shows when that texture is the back buffer. At the current level of zoom it is hard to read anything in the relevant zones but the colors at least give an idea of when the CPU is doing work for a given swap chain texture before it is displayed.

Unfortunately for this post I don’t think that there is a way to dynamically modify the colors of zones in the GPU timeline (instead it seems to be a requirement that are known at compile time) and so I can’t make the same visual correspondence. From a visualization standpoint I think it would be nice to show some kind of zone for the present queue (using Windows terminology), but even without that it can be understood implicitly. I will discuss the GPU timeline more later in the post when things are zoomed in further.

With all of that explanation let me show the initial screenshot again:

Hopefully it makes some kind of sense now what you are looking at!

Visualizing a Single Frame

Let us now zoom in further to just look at a single frame:

During this Frame 326 we can see that the DarkKhaki texture is being displayed as the front buffer. That means that the SteelBlue texture is the back buffer, which is to say that it is the texture that must be modified so that it can then be shown during Frame 327.

Look at the GPU timeline. There is a very small OliveDrab zone that shows work being done on the GPU. That is where the GPU actually modifies the SteelBlue back buffer texture.

Now look at the CPU timeline. There is a zone called RenderGraphicsFrameOnCpu which is where the CPU is recording the commands for the GPU to execute and then submitting those commands (zones are hierarchical, and so the zones below RenderGraphicsFrameOnCpu are showing it subdivided even further). The color is SteelBlue, and so these GPU commands will modify the texture that was being displayed in Frame 395 and that will again be displayed in Frame 397. You may notice that this section starts before the start of Frame 396, while the SteelBlue texture is still the front buffer and thus is still being displayed! In order to better understand what is happening we can zoom in even further:

Compare this with the previous screenshot. This is the CPU work being done at the end of Frame 365 and the beginning of Frame 366, and it is the work that will determine what is displayed during Frame 367.

The work that is done can be thought of as:

  • On the CPU, record some commands for the GPU to execute
  • On the CPU, submit those commands to the GPU so that it can start executing them
  • On the CPU, submit a swap command to change the newly-modified back buffer into the front buffer at the next vblank after all GPU commands are finished executing
  • On the GPU, execute the commands that were submitted

It is important that the GPU doesn’t start executing any commands that would modify the SteelBlue swap chain texture until that texture becomes the back buffer (and is no longer being displayed). The WaitForSwap zone shows where the CPU is waiting for the swap to happen before submitting the commands (which triggers the GPU to start executing the commands). There is no reason, however, that the CPU can’t record commands ahead of time, as long as those commands aren’t submitted to the GPU until the SteelBlue texture is ready to be modified. This is why the RenderGraphicsFrameOnCPU zone can start early: It records commands for the GPU (you can see a small OliveDrab section where this happens) but then must wait before submitting the commands (the next OliveDrab section).

How early can the CPU start recording commands? There are two different answers to this, depending on how the application works. The simple answer (well, “simple” if you understand D3D12 command allocators) is that recording can start as soon as the GPU has finished executing the commands that were previously submitted that were saved in the memory that the new recording is going to reuse. There is a check for this in my code that is so small that it can only be seen if the profiler is zoomed in even further

The reason that this wait is so short is because the GPU work being done is so simple that it reached the submitted swap long before the CPU checked to make sure.

Do you see that long line between executing the GPU commands and then recording new ones on the CPU? With the small amount of GPU work that my program is currently doing (clearing the texture and then drawing two quads) there isn’t anything to wait for by the time I am ready to start recording new commands.

If you’ve been following you might be asking yourself why I don’t start recording GPU commands even sooner. Based on what I’ve explained above the program could be even more efficient and start recording commands as soon as the GPU was finished executing the previous commands, and this would definitely be a valid strategy with the simple program that I have right now:

This is a capture that I made after I modified my program to record new GPU commands as soon as possible. The WaitForPredictedVblank zone is gone, the WaitForGpuToReachSwap zone is now visible at this level of zoom, and the WaitForSwap zone is now bigger. The overlapping of DarkKhaki and SteelBlue is much more pronounced because the CPU is starting to work on rendering a new version of the swap chain texture as soon as that swap chain texture is displayed to the user as a front buffer (although notice that the commands still aren’t submitted to the GPU until after the swap happens and the texture is no longer displayed to the user). Based on my understanding this kind of scheduling probably represents something close to the ideal situation if 1) a program wants to use vsync and 2) knows that it can render everything fast enough within one display refresh and 3) doesn’t have to worry about user input.

The next section explains what the WaitForPredictedVblank is for and why user input makes the idealized screenshot above not as good as it might at first seem.

When to Start Recording GPU Commands

Earlier I said that there were two different answers to the question of how early the CPU can start recording commands for the GPU. In my profile screenshots there is a DarkRed zone called WaitForPredictedVblank that I haven’t explained yet, but we did observe that it could be removed and that doing so allowed even more efficient scheduling of work. This WaitForPredictedVblank zone is related to the second alternate answer of when to start recording commands.

My end goal is to make a game, which means that the application is interactive and can be influenced by the player. If my program weren’t interactive but instead just had to render predetermined frames as efficiently as possible (something like a video player, for example) then it would make sense to start recording commands for the GPU as soon as possible (as shown in the previous section). The requirement to be interactive, however, makes things more complicated.

The results of an interactive program are non-deterministic. In the context of the current discussion this can be thought of as an additional constraint on when commands for the GPU can start being recorded, which is so simple that it is kind of funny to write out: Commands for the GPU to execute can’t start being recorded until it is known what the commands for the GPU to execute should be. The amount of time between recording GPU commands and the results of executing those commands being displayed has a direct relationship to the latency between a user providing input and the user seeing the result of that input on a display. The later that the contents of a rendered frame are determined the less latency the user will experience.

All of that is a long way of explaining what the WaitForPredictedVblank zone is: It is a placeholder in my engine for dealing with game logic and simulation updates. I can predict when the next vblank is (see the Syncing without VSync post for more details), and I am using that as a target for when to start recording the next frame. Since I don’t actually have any work to do yet I am doing a Sleep() in Windows, and since the results of sleeping have limited precision I only sleep until relatively close to the predicted vblank and then wait on the more reliable swap chain waitable object (this is the WaitForSwap zone):

(Side note: Being able to visualize this in the instrumented profile gives more evidence that my method of predicting when the vblank will happen is pretty reliable, which is gratifying.)

The next step will be to implement simulation updates using fixed timesteps and then record GPU commands at the appropriate time, interpolating between the two appropriate simulation updates. That will remove the big WaitForPredictedVblank, and instead there will be some form of individual simulation updates which should be visible.

Conclusion

If you’ve made it this far congratulations! I will show the initial screenshot one more time, showing the current state of my engine’s rendering and how work for the GPU is scheduled, recorded, and submitted:

There is a follow-up post about adding simulation frames here.

Syncing without VSync

I don’t remember how many years ago it was when I first read about doing this, but it was probably the following page which introduced me to the idea: https://blurbusters.com/blur-busters-lagless-raster-follower-algorithm-for-emulator-developers/. I am pleased to be able to say that I have finally spent the time to try and implement it myself:

How it Works

An application can synchronize with a display’s refresh cycle (instead of using vsync to do it) if two pieces of data are known:

  • The length of a single refresh (i.e. the refresh rate)
    • This is a duration
  • The time when a refresh ends and the next one starts (i.e. the vblank)
    • This is a repeating timestamp (a moment in time)

If both of these are known then the application can predict when the next refresh will start and update the texture that the graphics card is sending to the display at the appropriate time. If the graphics card changes the texture at the wrong time then “tearing” is visible, which is a horizontal line that separates the old previous texture above it and the new texture below it. (This Wikipedia article as a simulated example image of tearing.)

The texture that is active that the graphics card is sending to the display is called the “front buffer”. The texture that isn’t visible but that the application can generate before it is activated and being sent to the display is called the “back buffer”. There is different terminology for the act of making a back buffer into a front buffer but this post will call it “swapping”, and conceptually it can be thought of as treating the former back buffer as the new front buffer while what was the front buffer becomes a back buffer.

(What vsync does is take care of swapping front and back buffers at the appropriate time. If the application finishes generating a new texture in the back buffer and submits it in time then the operating system and graphics card will work together to make sure that it becomes the new front buffer for the display during the vblank. If the application doesn’t submit it in time and misses the vblank then nothing changes visually: The display keeps showing the old front buffer (meaning that a frame is repeated) and there is no tearing).

Why does swapping the front and back buffers at the wrong time cause tearing? Because even though swapping the front and back buffers happens instantaneously in computer memory the change from one frame to another on a display doesn’t. Instead the display updates gradually, starting at the top and ending at the bottom. Although this isn’t visible to a human eye the effect can be observed using a slow motion camera:

With that in mind it is possible to understand what I am doing in my video: I wrote a program that is manually swapping front and back buffers four times every refresh period in order to intentionally cause tearing. I am not actually rendering anything but instead just doing a very simple clearing of the entire back buffer to some single color, but by swapping a single color back buffer at the correct time in the middle of a display’s refresh I can change the color somewhere in the middle of the display.

Doing this isn’t particularly new and my results aren’t particularly impressive, but I was happy to finally find the time to try it.

Here is a bonus image where I wasn’t trying to time anything and instead just swapped between alternating red and green as quickly as I was able to:

Implementation Details

The rest of this post contains some additional technical information that I encountered while implementing this. Unlike the preceding section which was aimed at a more general audience the remainder will be for programmers who are interested in specific implementation details.

How to Calculate Time

Calculating the duration of a refresh period isn’t particularly difficult but it’s not sufficient to simply use the reported refresh rate. Although the nominal refresh rate would be close it wouldn’t exactly match the time reported by your CPU clock, and that’s what matters because that’s what you’ll be using to know when to swap buffers. In order to know what the refresh rate is in terms of the CPU clock an average of observed durations must be made. In order to calculate a duration it is necessary to keep track of how much time has elapsed between consistently repeating samples, but it doesn’t actually matter where in the refresh cycle these samples come from as long as they are consistently taken from the same (arbitrary) point in the refresh cycle. So, for example, timing after IDXGISwapChain::Present() returns (with a sync interval of 1 and a full present queue) would work, and timing after IDXGIOutput::WaitForVBlank() returns would also work.

It is more difficult, however, to calculate when the vblank actually happens.

DXGI_FRAME_STATISTICS

I finally settled on using IDXGISwapChain::GetFrameStatistics(). Using this meant that I was relying on DXGI instead of taking my own time measurements, but the big attraction of doing that is that the timestamps were then tied directly to discrete counters. Additionally, as a side benefit, after a bit of empirical testing it seemed like the sampled time in the DXGI frame statistics was always earlier than the time that I could sample myself, and so it seems like it is probably closer to the actual vblank than anything that I knew how to measure.

(The somewhat similar DwmGetCompositionTimingInfo() did not end up being as useful for me as I had initially thought. Alternatively, D3DKMTGetScanLine() seems like it could, in theory, be used for even more accurate results, but it wasn’t tied to discrete frame counters which made it more daunting. If my end goal had been just this particular demo I might have tried using that, but for my actual game engine renderer it seemed like IDXGISwapChain::GetFrameStatistics() would be easier, simpler and more robust.)

The problem that I ran into, however, is that I couldn’t find satisfactory explanations of what the fields of DXGI_FRAME_STATISTICS actually mean. I had to spend a lot of time doing empirical tests to figure it out myself, and I am going to document my findings here. If you found this post using an internet search for any of these DXGI_FRAME_STATISTICS-related terms then I hope this explanation saves you some time. (Alternatively, if you are reading this and find any mistakes in my understanding then please comment with corrections both for me and other readers.)

My Findings

The results of IDXGISwapChain::GetFrameStatistics() are a snapshot in time.

If you call IDXGISwapChain::GetLastPresentCount() immediately after IDXGISwapChain::Present() you will get the correct identifier for the present call that you just barely made, and this is very important to do in order to be able to correctly associate an individual present function call that you made with the information in the DXGI frame statistics (or, at least, it is conceptually important to do; you can also just keep track yourself of how many successful requests to present have been made).

On the other hand, if you call IDXGISwapChain::GetFrameStatistics() immediately after IDXGISwapChain::Present() there is no guarantee that you will get updated statistics (and, in fact, you most likely won’t). Instead, there is some non-deterministic (for you) moment in time after calling IDXGISwapChain::Present() where you would eventually get statistics for that specific request to present in the results of a call to IDXGISwapChain::GetFrameStatistics().

How do you know if the statistics you get are the ones that you want? You know that they are the ones that you want if the PresentCount field matches the value you got from IDXGISwapChain::GetLastPresentCount() after IDXGISwapChain::Present(). (UPDATE! The previous sentence, unfortunately, is not always true. See the end of this section for more information.) Once you call IDXGISwapChain::GetFrameStatistics() and get a PresentCount that matches the one that you’re looking for then you know two things:

  • The statistics that you now have refer to the known state of things when your submitted request to present (made by your call to IDXGISwapChain::Present() ) was actually presented
  • The statistics that you now have will not be updated again for your specific present request. What you now have is the snapshot that was made for your PresentCount , and no more snapshots will be made until another call to IDXGISwapChain::Present() is made (which means that the next time the statistics get updated they will be referring to a different PresentCount from the one that you are currently interested in).

Once you have a DXGI_FRAME_STATISTICS that is a snapshot for your specific PresentCount the important corresponding number is PresentRefreshCount. This tells which refresh of the display your request to present was actually presented during. If vsync is enabled PresentRefreshCount is the refresh of the display when your request to present was actually presented.

Once you have that information you can, incidentally, detect whether your request to present actually happened when you wanted and expected it to. This is described at https://learn.microsoft.com/en-us/windows/win32/direct3ddxgi/dxgi-flip-model#avoiding-detecting-and-recovering-from-glitches, in the “to detect a glitch” section. Although the description of what PresentCount and PresentRefreshCount is confusing to me in that document (and in other official documentation) the description of how to detect a glitch is consistent in my mind with how I have described these fields above, which helps to give me confidence that my understanding is probably correct.

Once you know the information above you can now potentially get timing information. The SyncRefreshCount refers to the same thing as PresentRefreshCount (it is a counter of display refresh cycles), and so it may be confusing why two different fields exist and what the distinction is between the two. PresentRefreshCount is, as described above, a mapping between PresentCount and a display refresh. SyncRefreshCount, on the other hand, is a mapping between the value in SyncQPCTime and a display refresh. The value in SyncQPCTime is a timestamp corresponding to the refresh in SyncRefreshCount. If SyncRefreshCount is the same as PresentRefreshCount then you know (approximately) the time of the vblank when your PresentCount request was actually displayed. It is possible, however, for SyncRefreshCount to be different from PresentRefreshCount, and that is why both fields are in the statistics struct.

To repeat: Information #1 is which display refresh your request was actually displayed in (comparing PresentCount to PresentRefreshCount) and information #2 is what the (approximate) time of a vblank for a specific refresh was (comparing SyncQPCTime to SyncRefreshCount). Derived information #3 is what the (approximate) time of a vblank was for the refresh that your request was actually displayed in.

(Side note: The official documentation here and here is very intentionally vague about when SyncQPCTime is actually measured. The driver documentation here, however, says “CPU time that the vertical retrace started”. I’m not sure if the more accessible user-facing documentation is intentionally vague to not be held accountable for how accurate the timing information is, or if the driver documentation is out-of-date. This post chooses to believe that the time is supposed to refer to the beginning of a refresh, with the caveat that I may be wrong and that even if I’m not wrong the sampled time is clearly not guaranteed to be highly accurate.)

One final thing to mention: A call to IDXGISwapChain::GetFrameStatistics() may return DXGI_ERROR_FRAME_STATISTICS_DISJOINT. One thing to note is that the values in PresentRefreshCount and SyncRefreshCount are monotonically-increasing and, specifically, they don’t reset even when the refresh rate changes. The consequence of this is that the DXGI_ERROR_FRAME_STATISTICS_DISJOINT result is very important for determining timing (like this post is concerned about). If you record the first PresentRefreshCount reported in the first successful call after DXGI_ERROR_FRAME_STATISTICS_DISJOINT was returned then you have a reference point for any future SyncRefreshCounts reported (until the next DXGI_ERROR_FRAME_STATISTICS_DISJOINT). Specifically, you know how many refresh cycles have happened with the current refresh rate.

Update: The documentation for the DXGI frame statistics has the following message which had confused me: “Note The number of times that an image was presented to the monitor is not necessarily the same as the number of times that you called IDXGISwapChain::Present or IDXGISwapChain1::Present1.” I didn’t understand how this could be the case, but thought that maybe it was referring to a situation where you called Present() and it failed. Unfortunately, I have since experienced the issue myself with a legitimate use case and no errors: If Present() is called with DXGI_PRESENT_RESTART such that some successfully-queued present is discarded then it can introduce a discrepancy into the counts! A call to IDXGISwapChain::GetLastPresentCount() will still return the number of successful calls to Present() (i.e. it is always monotonically-increasing), but the PresentCount reported in the frame statistics only refers to how many of the requested presents were actually presented (i.e. it doesn’t skip any counts when things in the present queue get discarded). This is frustrating because I don’t know of a way to know whether a present count is actually canceled or not when using DXGI_PRESENT_RESTART (it probably depends on when the call to Present() with that flag happens and how long it takes). If anyone reading this knows how to figure this out for sure please comment!

How to Calculate Refresh Period

Calculating the refresh period using SyncRefreshCount and SyncQPCTime is not difficult: Average the elapsed time between the sampled timestamps of refreshes. I am using the incremental method described here: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford’s_online_algorithm. This is easy to calculate and doesn’t require any storage beyond the current mean and the sample count. It can have problems with outliers or if the duration changes, though, and although I don’t anticipate either of those being issues it remains to be seen.

How to Predict VBlanks

I did some thinking about how to do this, and after some internet searching (I am not a numerical methods expert and so it took me a while to even figure out the correct search terms for what I was thinking) I found a really nice post about how to do exactly what I wanted, an incrementally-updating method for calculating the line that is the least-squares best fit for a bunch of sample points: https://blog.demofox.org/2016/12/22/incremental-least-squares-curve-fitting/. I liked this because it was a match for the incremental average function I was using, and since refresh cycles should happen regularly I figured that I could use SyncRefreshCount as the independent variable and SyncQPCTime as the dependent variable and then have a really computationally cheap way of predicting the time for any given refresh (in the past or in the future).

The good news is that this worked really well after some tweaking of my initial naive approach! The bad news is that the changes that I had to make in order to make it work well made me nervous about whether it would continue to perform well over time.

The big problem was losing precision. The SyncRefreshCount are inherently big numbers, but I already had to do some normalizing anyway so that they started at zero (see discussion above about DXGI_ERROR_FRAME_STATISTICS_DISJOINT) and so that didn’t seem so bad. SyncQPCTime, however, are also big numbers. The same trick of starting at zero can be used, and I also represented them as seconds (instead of the Windows high performance counts) and this helped me to get good results. I was worried about the long-term viability of this, however: Unlike the incremental method for the average this method required squaring numbers and multiplying big numbers together, and these numbers would constantly increase over time.

Even though I was quite happy with finding an algorithm that did what I had thought of, once I had implemented it there was still something that bothered me: I was trying to come up with a line equation, where the coefficients are the slope and the y-intercept. I already knew the slope, though, because I had a very good estimate of the duration of a refresh. In other words, I was solving for two unknowns using a bunch of sample points, but I already knew one of those unknowns! What I really wanted was to start with the slope and then come up with an estimate of the y-intercept only, and so it felt like the method I was using should be constrainable even more.

With that in mind I eventually came up with what I think, in hindsight, is a better solution even aside from precision issues. I know the “exact” duration between every vblank (we will conceptually consider that to be known, even though it’s an estimate), and for each reported sample I know the exact number of refreshes since the initial starting point (which is a nice discrete integer value), and then I know the approximate sampled time, which is the noisy repeated sample data I am getting that I want to improve in order to make predictions. What I can do, then, is to calculate what the initial starting time (for refresh count 0) would be, and incrementally calculate an average of that. This gives me the same cheap way of calculating the prediction (just a slope (refresh period) and a y-intercept (this initial timestamp)), but also a cheap way of updating this estimate (using the same incrementally-updating average method that I discussed above). And, even better, I can update it every frame without worrying about numerical problems. (Eventually with enough sample counts there will be issues with the updated value being too small, but that won’t impact the accuracy of the current average if we assume that it is very accurate by then.)

This means that I don’t have to spend time initially waiting for the averages to converge; instead I can just start with single sample that is already a reasonably good estimate and then proceed with normal rendering, knowing that my two running averages will keep getting more accurate over time with every new DXGI frame statistic that I get with new SyncRefresh information.

Type Trait Coding Style

One of my goals while creating a custom game engine is to avoid using the C++ standard library. This post won’t discuss why, but one consequence of that goal that I have run into during this early phase of the project is that I have had to recreate some fundamental machinery that the standard library provides. By “recreate” I don’t mean that I have figured things out myself starting from nothing but first principles, but instead that I have had to look at existing documentation and implementations to understand how something works and then reimplement it in my own style. It has been fun to learn a bit more about how this corner of C++ metaprogramming works; although I have used these features (some of them frequently) I have only vaguely understood how they might have actually been implemented.

Doing this has also had the interesting consequence of presenting me with challenges to my existing coding style and I have had to expand and adapt. This post gives some examples of changes to my coding style that I have been experimenting with in order to accommodate type traits.

Existing Style

My current style uses a lowercase single letter prefix to indicate type names, meaning that if an identifier is spelled e.g. jSomething or pSomething the reader can immediately know that those names identify types, even without knowing what the j or p might mean (and, to be clear, those are just examples, and neither the j nor the p prefix exists (yet)).

Class names start with a c prefix:

class cMyClass;
class cSomeOtherClass;

Struct names start with an s prefix, base class names start with an i prefix (for “interface”, where my goal is that leaf classes are the only ones that are ever instantiated), and enumeration names start with an e prefix:

// When I use a struct instead of a class it means:
//	* The member variables are public instead of private
//	* The member variables use a different naming scheme
struct sMyStruct;
// The reader can tell that the following is a base class
// (and conceptually an abstract base class)
// just from the type name alone:
class iMyBaseClass;
// Using this enum convention makes it
// less annoying to use scope enumerations
// because the prefix makes it instantly identifiable
// and so the identifiers can be chosen accordingly:
enum class eMyState
{
	On,
	Off,
};

Type names start with a t prefix, e.g.:

// Creating a type alias:
using tMySize = int;
// In templates:
template<typename tKey, typename tValue>
class cMyMap;

(I don’t like the common convention of using single letters (e.g. T and U) in templates, and find that it makes code much harder to read for me personally, similarly to how I feel about single letter variable names. This strongly-held conviction has been challenged in some cases by working with type traits, which I discuss below.)

Type Names

Some of the type trait information that I have needed are expressions that are types. The standard uses a _t suffix for this, which is a helper type of a struct, but I have never loved this convention. In my code so far I have used a t prefix for cases like, and hidden the associated struct in a details namespace:

// tNoConst -> Type without a const qualifier
namespace Types::detail
	{
		template<typename T> struct sNoConst { using type = T; };
		template<typename T> struct sNoConst<const T> { using type = T; };
	}
template<typename T>
using tNoConst = typename Types::detail::sNoConst<T>::type;

// An example of this being used:
const int someConstVariable = 0;
tNoConst<decltype(someConstVariable)> someMutableVariable;
someMutableVariable = 0;

This use of the t prefix is not really any different from how I had already named types, and I find that it fits in naturally when used in code. Eagle-eyed readers may notice, however, that I am using a single T as a type, which I had mentioned is something that I strongly dislike and claimed that I don’t do!

In all of my previous template programming I have always been able to come up with some name that made sense in context, even if it was sometimes generic like tValue. While working on these type traits, however, I realized that there are cases (like shown above) where the type really could be any type. I considered tType, but that seemed silly. I considered tAny, and I might still end up changing my mind and refactoring the code to that (or something similar). For now, though, I have capitulated and am using just the T for fully generic fundamental type trait building blocks like the ones discussed in this post (in other code, though, I still intend to strictly adhere to my give-the-type-a-name rule).

Value Names

Some of the type trait information that I have needed are expressions that are values. The standard uses a _v suffix for this, but I have never loved this convention. In fact, I’m not sure that I really understand this convention; unlike with _t where there needs to be some underlying struct for the metaprogramming implementation to work it doesn’t seem like values need this (at least, the ones that I have recreated so far haven’t needed an underlying struct).

I did struggle a bit with how to name these, however. My existing coding convention would prefix global variables names with g_ (that will have to be discussed in a separate post), but these type trait variables feel different from traditional global variables to me. In my mind they conceptually feel more like functions than variables, but functions that I call with <> instead of (). I wanted some alternate convention to make them visually distinct from standard variables.

After some experimentation I eventually settled on keeping the v from the standard but making it a prefix instead of a suffix, and I have been pretty happy so far with the result:

// vIsConst -> Whether a type is const-qualified
template<typename T>
constexpr bool vIsConst = false;
template<typename T>
constexpr bool vIsConst<const T> = true;

// An example of this being used:
if constexpr (vIsConst<decltype(someVariable)>)
{
	// Stuff
}

This convention has added a new member to my pantheon of prefixes, but it has felt natural and like a worthy addition so far. As an additional unexpected bonus it has also given me a new convention for naming template non-type parameters:

// My new convention:
template<typename tSomeType, bool vSomeCondition>
class myClass1;

// My previous convention, which I never loved:
template<typename tSomeType, bool tSomeCondition>
class myClass2;

Having a new way of unambiguously specifying compile-time values has improved the readability of my code for me.

Concept Names

I have encountered one case where I wanted to make a named constraint and I had to think about what to name it. I don’t have enough experience yet to know whether my initial attempt is something that I will end up liking, but this is what I have come up with:

// rIsBaseOf -> Enforces vIsBaseOf
template<typename tDerived, typename tBase>
concept rIsBaseOf = vIsBaseOf<tBase, tDerived>;

// An example of this being used:
template<rIsBaseOf<iBaseClass> tSomeClass>
class cMyConstrainedClass;

I couldn’t use c for “constraint” or “concept” because that was already taken for classes. I finally settled on r for “restraint” (kind of like a mix of “constraint” and “restrict”, with a suggestion of requires) and I don’t hate it so far but I also don’t love it. It feels like it is good enough to do the job for me in my own code, but it also feels like maybe there’s a better convention that I haven’t thought of yet.

Application with an Empty Window

Behold, the main window of an application using my nascent engine:

It does as little as one might expect from the screenshot, but some of the architecture might be of interest.

Build System

This section shows parts of the jpmake project file.

Platforms and Configurations

The different possible platforms and configurations are defined as follows:

-- Platforms
------------

-- Define the platforms by name and type

platform_windows = DefinePlatform("windows", "win64")
local platform_windows = platform_windows

-- Get the specified platform to execute tasks for
platform_target = GetTargetPlatform()
local platform_target = platform_target

-- Configurations
-----------------

-- Define the configurations by name

configuration_unoptimized = DefineConfiguration("unoptimized")
local configuration_unoptimized = configuration_unoptimized

configuration_optimized = DefineConfiguration("optimized")
local configuration_optimized = configuration_optimized

configuration_profile = DefineConfiguration("profile")
local configuration_profile = configuration_profile

configuration_release = DefineConfiguration("release")
local configuration_release = configuration_release

-- Get the specified configuration to execute tasks for
local configuration_current = GetCurrentConfiguration()

I am creating global variables for each platform and configuration so that they are accessible by other Lua files and then immediately assigning them to local variables so that they are cheaper to use in this file. (I currently only have the single Lua project file, but soon I will want to change this and have separate files that can focus on different parts of the project.)

At the moment I am specifying any platform-specific information using a strategy like if platform_target == platform_windows and that works fine (there are several examples later in this post), but I am considering defining something like isWindows = platform_target == platform_windows instead. There won’t be many platforms (only one for the foreseeable future!) and it seems like it would be easier to read and write many platform-specific things with a single boolean rather than with a long comparison. I am doing something similar with the configurations where I define booleans that serve as classification descriptions, and so far it feels nice to me (again, there are examples later in this post).

Directory Structure

The directory structure from the perspective of jpmake is currently defined as follows:

-- Source Files
do
	SetEnvironmentVariable("engineDir", "Engine/")
	SetEnvironmentVariable("appsDir", "Apps/")
end
-- Generated Files
do
	-- Anything in the temp directory should be generated by jpmake executing tasks
	-- and the entire folder should be safely deletable.
	-- Additionally, any files that are not part of the Git repository
	-- should be restricted to this folder.
	SetEnvironmentVariable("tempDir", ConcatenateTable{"temp/", platform_target:GetName(), "/", configuration_current, "/"})
	-- The intermediate directory is for files that must be generated while executing tasks
	-- but which aren't required to run the final applications
	SetEnvironmentVariable("intermediateDir", "$(tempDir)intermediate/")
	SetEnvironmentVariable("intermediateDir_engine", "$(intermediateDir)engine/")
	-- The artifact directory is where jpmake saves files
	-- that it uses to execute tasks
	SetArtifactDirectory("$(intermediateDir)jpmake/")
	-- The staging directory contains the final applications that can be run
	-- independently of the source project and intermediate files
	SetEnvironmentVariable("stagingDir", "$(tempDir)staging/")
end

As a general rule I don’t like abbreviations in variable or function names but I decided to keep the “dir” convention from Visual Studio since these environment variable names will be used so frequently in paths that it seems like a reasonable exception to keep things shorter and more readable. (I did, however, decide to change the first letter to lowercase which fits with my variable naming convention better.)

An issue that I have run into in the past is having trouble deciding how to name directory environment variables to distinguish between source and generated files, and with games where there can be code and assets both for the engine and the application the possible choices are even more complex (and, to make matters worse, with this project I am intending to support multiple applications using the engine and so there is yet a further distinction that must be made). What I have will likely change as time goes on and I write more code, but it feels like a good start. The root repository folder looks like this:

Any files that are generated by the build process are kept quarantined in a single folder (temp/) so that the distinction between source and deletable files is very clear. This is very important to me (as anyone who has worked with me can attest). The temp directory looks like the following, expanded for one platform and configuration:

With such a simple application the only thing in the staging directory is the executable file, but when I develop more complicated applications there will be other files in staging directories (e.g. the assets that the game loads).

One further consideration that is currently missing is what to do with “tools”, those programs that are used during development (either for authoring content or as part of the build process) but that don’t get released to end users. I can imagine that I might want to update some of this directory structure when I start developing tools.

C++ Configuration

The next section in the jpmake project file configures the default settings for how C++ is built for the current target platform and build configuration:

-- C++
------

-- Initialize C++ for the current platform and configuration
cppInfo_common = CreateCppInfo()
local cppInfo_common = cppInfo_common
do
	-- #define VLSH_PLATFORM_SOMENAME for conditional compilation
	do
		local platform_define_suffix
		if (platform_target == platform_windows) then
			platform_define_suffix = "WINDOWS"
		else
			platform_define_suffix = "NONE"
		end
		cppInfo_common:AddPreprocessorDefine(("VLSH_PLATFORM_" .. platform_define_suffix),
			-- There isn't any anticipated reason to check anything other than whether the platform is #defined,
			-- but the name is used as a value because why not?
			platform_target:GetName())
	end
	-- The project directory is used as an $include directory
	-- so that directives like the following can be done to show scope:
	--	#include <Engine/SomeFeature/SomeHeader.hpp>	
	cppInfo_common:AddIncludeDirectory(".")
	local isOptimized = configuration_current ~= configuration_unoptimized
	cppInfo_common:AddPreprocessorDefine("VLSH_CONFIGURATION_ISOPTIMIZED", isOptimized)
	local isForProfiling = configuration_current == configuration_profile
	cppInfo_common:AddPreprocessorDefine("VLSH_CONFIGURATION_ISFORPROFILING", isForProfiling)
	local isForRelease = configuration_current == configuration_release
	cppInfo_common:AddPreprocessorDefine("VLSH_CONFIGURATION_ISFORRELEASE", isForRelease)
	do
		local areAssertsEnabled = not isForRelease and not isForProfiling
		cppInfo_common:AddPreprocessorDefine("VLSH_ASSERT_ISENABLED", areAssertsEnabled)
	end
	cppInfo_common.shouldStandardLibrariesBeAvailable = false
	cppInfo_common.shouldPlatformLibrariesBeAvailable = false
	cppInfo_common.shouldExceptionsBeEnabled = false
	cppInfo_common.shouldDebugSymbolsBeAvailable =
		-- Debug symbols would also have to be available for release in order to debug crashes
		not isForRelease
	if platform_target == platform_windows then
		cppInfo_common.VisualStudio.shouldCRunTimeBeDebug = not isOptimized
		cppInfo_common.VisualStudio.shouldIncrementalLinkingBeEnabled =
			-- Incremental linking speeds up incremental builds at the expense of bigger executable size
			not isForRelease
		-- Warnings
		do
			cppInfo_common.VisualStudio.shouldAllCompilerWarningsBeErrors = true
			cppInfo_common.VisualStudio.shouldAllLibrarianWarningsBeErrors = true
			cppInfo_common.VisualStudio.compilerWarningLevel = 4
		end
	end
end

This shows the general approach I am taking towards configuring things (both from the perspective of the game engine and also from the perspective of jpmake and my personal ideal way of configuring software builds). The named configurations (e.g. unoptimized, optimized, profile, release) that I defined earlier are just arbitrary names from the perspective of jpmake and don’t have any semantics associated with them. Instead it is up to the user to specify how each configuration behaves. I can imagine that this would be seen as a negative for most people, but I have a personal issue where I generally prefer to have full control over things.

This section should not be understood as being complete (most notably there actually aren’t any optimization-related settings except for which C run-time to use!) but that is because I haven’t implemented all of the Visual Studio options in jpmake yet.

Engine Static Library

Below is one example of a static library that I have made, which provides base classes for applications (meaning that an actual application can inherit from the provided framework):

do
	local task_application = CreateNamedTask("Application")
	local cppInfo_application = cppInfo_common:CreateCopy()
	do
		if (platform_target == platform_windows) then
			cppInfo_application.shouldPlatformLibrariesBeAvailable = true
		end
	end
	engineLibrary_application = task_application:BuildCpp{
			target = "$(intermediateDir_engine)Application.lib", targetType = "staticLibrary",
			compile = {
				"$(engineDir)Application/iApplication.cpp",
				"$(engineDir)Application/iApplication_windowed.cpp",
				platform_target == platform_windows and "$(engineDir)Application/iApplication_windowed.win64.cpp" or nil,
			},
			link = {
				engineLibrary_assert,
				platform_target == platform_windows and CalculateAbsolutePathOfPlatformCppLibrary("User32.lib", cppInfo_application) or nil,
			},
			info = cppInfo_application,
		}
end
local engineLibrary_application = engineLibrary_application

My current plan is to have the “engine” consist of a collection of static libraries that all get linked into the single application executable.

This named task shows a file specific to Windows that is only compiled for that platform (iApplication_windowed.win64.cpp, where my convention is to try to put as much platform-specific code in separate platform-specific CPP files as possible and then those files have the platform name as a sub-extension), as well as a Windows library that is only needed for linking on that platform (User32.lib) and another static library (engineLibrary_assert, which was defined earlier but that I don’t show in this blog post) that this static library depends on.

As more files get created that are specific to one platform or another I think my style will have to change to make it less annoying to conditionally specify each one.

Applications

Finally, the two proof-of-concept applications that I have created are defined as follows:

-- Hello World
--============

do
	do
		SetEnvironmentVariable("appDir", "$(appsDir)HelloWorld/")
		SetEnvironmentVariable("stagingDir_app", "$(stagingDir)HelloWorld/")
	end
	local cppInfo_helloWorld = cppInfo_common:CreateCopy()
	do
		-- For std::cout
		cppInfo_helloWorld.shouldStandardLibrariesBeAvailable = true
	end
	do
		local helloWorld_task = CreateNamedTask("HelloWorld")
		local application_subTask = helloWorld_task:BuildCpp{
				target = "$(stagingDir_app)HelloWorld.exe", targetType = "consoleApplication",
				compile = {
					"$(appDir)EntryPoint.cpp",
				},
				info = cppInfo_helloWorld,
			}
		helloWorld_task:SetTargetForIde(application_subTask)
	end
end

-- Empty Window
--=============

do
	do
		SetEnvironmentVariable("appDir", "$(appsDir)EmptyWindow/")
		SetEnvironmentVariable("stagingDir_app", "$(stagingDir)EmptyWindow/")
	end
	local cppInfo_emptyWindow = cppInfo_common:CreateCopy()
	do
		cppInfo_emptyWindow:AddIncludeDirectory("$(appDir)")
	end
	do
		local emptyWindow_task = CreateNamedTask("EmptyWindow")
		local application_subTask = emptyWindow_task:BuildCpp{
				target = "$(stagingDir_app)EmptyWindow.exe", targetType = "windowedApplication",
				compile = {
					"$(appDir)EntryPoint.cpp",
				},
				link = {
					engineLibrary_application,
				},
				info = cppInfo_emptyWindow,
			}
		emptyWindow_task:SetTargetForIde(application_subTask)
	end
end

These show the general approach towards making executable applications that I am envisioning, although these both are as simple as possible.

One idiom that I discovered is reusing the same environment variable names but setting them to different values for different applications. This allowed the names to be shorter and thus more readable (before this I had different versions with _helloWorld and _emptyWindow), but I don’t have enough experience to decide if this will work well long term.

The examples also show calls to SetTargetForIde(), which has no effect when executing tasks but is instead used when generating the solution files so that Visual Studio will correctly have its $(TargetPath) set, which makes setting up debugging easier.

Visual Studio Solution

It is now possible for jpmake to generate Visual Studio solution and project files. I did this work to make it easier to write code and debug in Visual Studio. The Solution Explorer currently looks like the following for the jpmake project that I have been showing in this post:

And the properties of the EmptyWindow project have some things filled in:

I had to spend more time on generating these files and additional jpmake features than I had initially anticipated before working on the engine code because I wasn’t able to debug, which felt like a requirement. With the way it works now, however, I was able to write the empty window application and things worked reasonably well.

I did have one discouraging realization, however, which is that Intellisense doesn’t work yet. I was able to complete the empty window application without it but it was more annoying than I would have anticipated. I think I need to take some more time to improve jpmake so that Intellisense will work at least somewhat because not having it has proven to be an annoying impediment.

Creating a Video Game and Engine from Scratch

I am attempting to make a video game without using an available engine. I am documenting the process with two different kinds of posts:

Why?

I am currently unemployed by choice. I have been very fortunate professionally that I have been able to work on interesting projects, but there are also some things that I have never had a chance to work on that I am personally interested in and would like to explore. Primary among these are:

  • Graphics
    • Hardware ray tracing
      • I fell in love with all things ray tracing while a student at the University of Utah but I have not had an opportunity to develop anything since hardware support was added to consumer GPUs
    • HDR
      • Although I understand the concepts of tonemapping I have never had the opportunity to implement it, and after all of the work I have done with color and light it is an obvious next step and gap in my experience that I would like to fill
  • Memory Management
    • I have a fascination with manual memory management, and I am interested in attempting to write software that always uses explicit allocators rather than the global new/delete so that all memory is budgeted and uses appropriate allocation strategies
    • I am also interested in trying to go all-in on “data-oriented design” to gain more experience about how to design large scale software with cache-friendliness and good access patterns as a priority
  • Engine
    • I would like a better custom basis that I can use to create applications (not just necessarily games)
    • I have almost entirely worked with graphics in my professional career and I would like to gain experience and learn the technical aspects of other areas
  • Game Design
    • I have ideas of things that I like and things that I don’t like from playing games, but my actual experience of user-facing design is limited to APIs and GUI properties

It can be hard to find the energy or motivation to do personal projects like this with a full-time job, and especially with jobs in the industries that I work in which tend to be high-demand (which, to be fair, is also what makes them fun). With this in mind I have quit my job and am taking a temporary break from employment so that I can focus full time on learning and implementing some of the things listed above.

Goals and Scope

The time frame for working on this project is quite limited before I will have to find another paying job. If the end goal were to actually create a finished game that could be released and sold then the only reasonable strategy would be to use an established game engine (like Unreal or Unity), and even then the only kind of game that a solo developer could create in a short amount of time would have to be extremely limited in scope. The majority of my goals for this project, however, involve creating the technology from scratch. This means that the actual goal is primarily to create a game engine, and any “games” that I make will be more like tech demos.

The meaning of what a game engine is in the popular imagination seems to have changed over the years and my impression is that today many people think of a “game engine” as something like Unreal or Unity where there is an editor and it is meant to be used by external users to make any kind of game imaginable. That definition is not what I mean when I talk about what I want to work on, however: I am interested in making something just for me that serves as a framework to create interactive applications. Although in real game development tools are incredibly important I don’t anticipate (sadly) creating any editors for my current project, and any tools will be programs that build game-ready assets from authored assets.

My ideal outcome would be to actually create some finished experience that I could release in some way, but due to the uncertainty around the time available I don’t believe that that should be my criterion for evaluating whether the project succeeds or fails. Instead, my intent is to try and create several small applications along the way that are unfinished and unpolished (i.e. not releasable to a general audience) but that allow me to have some small scale goal to work towards when implementing features. I will also try to document milestones along the way with posts here. The true measure of success for me personally will be if I have been able to implement some of the features listed in the bullet points above.

With that being said, as a way of setting realistically low expectations my goals for an eventual releasable program would be:

  • Graphics rendered using hardware ray tracing
  • Audio sound effects that play dynamically in response to something
  • A player avatar in some kind of third-person view that can be controlled using an Xbox controller
  • Some kind of action/interaction that the player character can do with the environment

My graduate students in a semester-long class used to accomplish something similar (without the ray tracing) using a starting engine that I provided and so the list above feels achievable. I will have to decide as time passes whether to try and focus on implementing the above points early and then improving things or whether to be content focusing on the individual features that are interesting in the moment even if it means not finishing a final project.