Frame Pacing in a Very Simple Scene

I have recently integrated the Tracy profiler into my engine and it has been a great help to be able to visualize how the CPU and GPU are interacting. Even though what is being rendered is as embarrassingly simple as possible there were some things I had to fix that weren’t behaving as I had intended. Until I saw the data visualized, however, I wasn’t aware that there were problems! I have also been using PIX for Windows, NSight, RenderDoc, and gpuview, but Tracy has really been useful in terms of presenting the information across multiple frames in a way that I can customize to see the relationships that I have wanted to see. I thought that it might be interesting to post about some of the issues with screenshots from the profiler while things are still simple and relatively easy to understand.

Visualizing Multiple Frames

Below is a screenshot of a capture from Tracy:

I have zoomed in at a level where 5 full frames are visible, with a little bit extra at the left and right. You can look for Frame 395, Frame 396, Frame 397, Frame 398, and Frame 399 to see where the frames are divided. These frame boundaries are explicitly marked by me, and I am doing so in a thread dedicated to waiting for IDXGIOutput::WaitForVBlank() and marking the frame; this means that a “frame” in the screenshot above indicates a specific frame of the display’s refresh cycle.

There is a second frame visualization at the top of the screen shot where there are many green and yellow rectangles. Each one of those represents the same kind of frames that were discussed in the previous paragraph, and the purple bar shows where in the timeline I am zoomed into (it’s hard to tell because it’s so small but there are 7 bars within the purple section, corresponding to the 1 + 5 + 1 frames visible at the level of zoom).

In addition to marking frames Tracy allows the user to mark what it calls “zones”. This is a way to subdivide each frame into separate hierarchical sections in order to visualize what is happening at different points in time during a frame. There are currently three threads shown in the capture:

  • The main thread (which is all that my program currently has for doing actual work)
  • An unnamed thread which is the vblank heartbeat thread
  • GPU execution, which is not a CPU thread but instead shows how GPU work lines up with CPU work

In order to try and help me make sure that I was understanding things properly I have color coded some zones specifically according to which swap chain texture is relevant. At the moment my swap chain only has two textures (meaning that there is only a single back buffer at any one time and the two textures just get toggled between being the front buffer or back buffer any time a swap happens) and they are shown with DarkKhaki and SteelBlue. In the heartbeat thread the DisplayFrontBuffer zone is colored according to which texture is actually being displayed during that frame (actually this is not true because of the Desktop Windows Manager compositor, but for the purposes of this post we will pretend that it is true conceptually).

I have used the same colors in the main CPU thread to show which swap chain texture GPU commands are being recorded and submitted for. In other words, the DarkKhaki and SteelBlue colors identify a specific swap chain texture, the heartbeat thread shows when that texture is the front buffer, and the main thread shows when that texture is the back buffer. At the current level of zoom it is hard to read anything in the relevant zones but the colors at least give an idea of when the CPU is doing work for a given swap chain texture before it is displayed.

Unfortunately for this post I don’t think that there is a way to dynamically modify the colors of zones in the GPU timeline (instead it seems to be a requirement that are known at compile time) and so I can’t make the same visual correspondence. From a visualization standpoint I think it would be nice to show some kind of zone for the present queue (using Windows terminology), but even without that it can be understood implicitly. I will discuss the GPU timeline more later in the post when things are zoomed in further.

With all of that explanation let me show the initial screenshot again:

Hopefully it makes some kind of sense now what you are looking at!

Visualizing a Single Frame

Let us now zoom in further to just look at a single frame:

During this Frame 326 we can see that the DarkKhaki texture is being displayed as the front buffer. That means that the SteelBlue texture is the back buffer, which is to say that it is the texture that must be modified so that it can then be shown during Frame 327.

Look at the GPU timeline. There is a very small OliveDrab zone that shows work being done on the GPU. That is where the GPU actually modifies the SteelBlue back buffer texture.

Now look at the CPU timeline. There is a zone called RenderGraphicsFrameOnCpu which is where the CPU is recording the commands for the GPU to execute and then submitting those commands (zones are hierarchical, and so the zones below RenderGraphicsFrameOnCpu are showing it subdivided even further). The color is SteelBlue, and so these GPU commands will modify the texture that was being displayed in Frame 395 and that will again be displayed in Frame 397. You may notice that this section starts before the start of Frame 396, while the SteelBlue texture is still the front buffer and thus is still being displayed! In order to better understand what is happening we can zoom in even further:

Compare this with the previous screenshot. This is the CPU work being done at the end of Frame 365 and the beginning of Frame 366, and it is the work that will determine what is displayed during Frame 367.

The work that is done can be thought of as:

  • On the CPU, record some commands for the GPU to execute
  • On the CPU, submit those commands to the GPU so that it can start executing them
  • On the CPU, submit a swap command to change the newly-modified back buffer into the front buffer at the next vblank after all GPU commands are finished executing
  • On the GPU, execute the commands that were submitted

It is important that the GPU doesn’t start executing any commands that would modify the SteelBlue swap chain texture until that texture becomes the back buffer (and is no longer being displayed). The WaitForSwap zone shows where the CPU is waiting for the swap to happen before submitting the commands (which triggers the GPU to start executing the commands). There is no reason, however, that the CPU can’t record commands ahead of time, as long as those commands aren’t submitted to the GPU until the SteelBlue texture is ready to be modified. This is why the RenderGraphicsFrameOnCPU zone can start early: It records commands for the GPU (you can see a small OliveDrab section where this happens) but then must wait before submitting the commands (the next OliveDrab section).

How early can the CPU start recording commands? There are two different answers to this, depending on how the application works. The simple answer (well, “simple” if you understand D3D12 command allocators) is that recording can start as soon as the GPU has finished executing the commands that were previously submitted that were saved in the memory that the new recording is going to reuse. There is a check for this in my code that is so small that it can only be seen if the profiler is zoomed in even further

The reason that this wait is so short is because the GPU work being done is so simple that it reached the submitted swap long before the CPU checked to make sure.

Do you see that long line between executing the GPU commands and then recording new ones on the CPU? With the small amount of GPU work that my program is currently doing (clearing the texture and then drawing two quads) there isn’t anything to wait for by the time I am ready to start recording new commands.

If you’ve been following you might be asking yourself why I don’t start recording GPU commands even sooner. Based on what I’ve explained above the program could be even more efficient and start recording commands as soon as the GPU was finished executing the previous commands, and this would definitely be a valid strategy with the simple program that I have right now:

This is a capture that I made after I modified my program to record new GPU commands as soon as possible. The WaitForPredictedVblank zone is gone, the WaitForGpuToReachSwap zone is now visible at this level of zoom, and the WaitForSwap zone is now bigger. The overlapping of DarkKhaki and SteelBlue is much more pronounced because the CPU is starting to work on rendering a new version of the swap chain texture as soon as that swap chain texture is displayed to the user as a front buffer (although notice that the commands still aren’t submitted to the GPU until after the swap happens and the texture is no longer displayed to the user). Based on my understanding this kind of scheduling probably represents something close to the ideal situation if 1) a program wants to use vsync and 2) knows that it can render everything fast enough within one display refresh and 3) doesn’t have to worry about user input.

The next section explains what the WaitForPredictedVblank is for and why user input makes the idealized screenshot above not as good as it might at first seem.

When to Start Recording GPU Commands

Earlier I said that there were two different answers to the question of how early the CPU can start recording commands for the GPU. In my profile screenshots there is a DarkRed zone called WaitForPredictedVblank that I haven’t explained yet, but we did observe that it could be removed and that doing so allowed even more efficient scheduling of work. This WaitForPredictedVblank zone is related to the second alternate answer of when to start recording commands.

My end goal is to make a game, which means that the application is interactive and can be influenced by the player. If my program weren’t interactive but instead just had to render predetermined frames as efficiently as possible (something like a video player, for example) then it would make sense to start recording commands for the GPU as soon as possible (as shown in the previous section). The requirement to be interactive, however, makes things more complicated.

The results of an interactive program are non-deterministic. In the context of the current discussion this can be thought of as an additional constraint on when commands for the GPU can start being recorded, which is so simple that it is kind of funny to write out: Commands for the GPU to execute can’t start being recorded until it is known what the commands for the GPU to execute should be. The amount of time between recording GPU commands and the results of executing those commands being displayed has a direct relationship to the latency between a user providing input and the user seeing the result of that input on a display. The later that the contents of a rendered frame are determined the less latency the user will experience.

All of that is a long way of explaining what the WaitForPredictedVblank zone is: It is a placeholder in my engine for dealing with game logic and simulation updates. I can predict when the next vblank is (see the Syncing without VSync post for more details), and I am using that as a target for when to start recording the next frame. Since I don’t actually have any work to do yet I am doing a Sleep() in Windows, and since the results of sleeping have limited precision I only sleep until relatively close to the predicted vblank and then wait on the more reliable swap chain waitable object (this is the WaitForSwap zone):

(Side note: Being able to visualize this in the instrumented profile gives more evidence that my method of predicting when the vblank will happen is pretty reliable, which is gratifying.)

The next step will be to implement simulation updates using fixed timesteps and then record GPU commands at the appropriate time, interpolating between the two appropriate simulation updates. That will remove the big WaitForPredictedVblank, and instead there will be some form of individual simulation updates which should be visible.

Conclusion

If you’ve made it this far congratulations! I will show the initial screenshot one more time, showing the current state of my engine’s rendering and how work for the GPU is scheduled, recorded, and submitted:

Devlog 2024-12-16

This is a regularly-occurring status update. More generally-relevant posts can be found under Features (see Creating a Game and Engine from Scratch for context).

This is the beginning of week 9.

What I have done

  • The texture is now uploaded using the copy command queue instead of the graphics command queue
  • Made a platform-independent interface for vertex buffers
  • Made a platform-independent interface for index buffers and am using it now to draw the simple square
  • Made a platform-independent interface for constant buffers and am using it now to draw two squares (where it is being treated as a per-draw constant buffer)
  • Added the Tracy profiler
    • Implemented an interface for CPU profiling that doesn’t require adding anything from Tracy in header files
    • Implemented an interface for GPU profiling that doesn’t require adding anything from Tracy in header files
    • Modified the PIX for Windows integration to work more like Tracy
  • Build system
    • Added a Lua function to resolve paths
      • This makes it easier and nicer to generate header #include files so that resolving environment variable can be done automatically
    • Fixed a tricky bug when adding multiple levels of filters in Visual Studio
    • Fixed a bug when several different C++ source files specified different ways of building that were independent of the general way specified for the target
    • Found a workaround to help Intellisense give proper x64 suggestions (instead of x86)
      • This didn’t affect the actual compiled or linked files, but it was a relief to finally have pointers and struct sizes reported correctly

Next Steps

I am getting stuck (classic “analysis paralysis”) on trying to figure out how I want frame updates to work. This is kind of where I got stuck previously when I was doing the “beam racing” stuff, and even though I finally gave up and moved on I am running into it again with a different context now while I am trying to figure out how to have the application tell the graphics system what to render and when. It would be easy to just do something simple and move on and maybe I should, but one of the personal goals that I’ve wanted to do with this project is to have fixed simulation timesteps that are independent from the fixed display refresh rate, and what I want to accomplish is pretty straightforward even though I’m getting hung up on details.

  • Implementing the Tracy profiler has been a big help, and I am now able to better visualize what my code is actually doing. I need to somehow get something in place, even if it’s just rudimentary, so that I can have simulation updates (that can run both slower and faster than the display refresh rate) and then choose the appropriate time to render an interpolation between two of them.
  • I still need to implement a way in the build system to compile shaders offline and then a way to load them at runtime, even if very simple
    • (Recall that I am avoiding the standard library and general-purpose memory management, which is why some of those seemingly-simple tasks involve more work than might be expected)
  • I want to add a camera and 3D transforms

Devlog 2024-12-09

This is a regularly-occurring status update. More generally-relevant posts can be found under Features (see Creating a Game and Engine from Scratch for context).

This is the beginning of week 8.

What I have done

I got very sick this week and so a lot of time was lost.

  • Fix precompiled header source linking
    • It turns out that on MSVC it’s not enough to use a precompiled header, but the source (i.e. CPP) file that created it must also be linked. This kind of makes sense in retrospect, but things were somehow working without doing that for a while.
  • Created a texture resource in an “upload” heap, mapped and copied texture data to it there, and then copied the resource to the shader-visible default heap
    • This all just followed the Microsoft “hello” sample
  • Create constant buffer in different ways and different places to move triangle mesh around
    • I think I am finally understanding how resource management works in D3D12, after an embarrassingly long time and an embarrassing amount of re-reading different documentation and forum posts
    • It’s still unclear to me what the best strategy to manage it all is, but at least that isn’t embarrassing because it seems intentionally flexible and so everyone is a little unclear because there are many different ways to approach it
    • Additionally, since it was originally released there are new techniques for “bindless” rendering and it seems like I will probably want to use those for ray tracing, and so there is still more to learn. One step at a time, though…

Next Steps

The sickness has thrown me off of my schedule and I also find that I am suffering with the mental symptom of wanting-everything-perfect-and-not-knowing-where-to-start with some of the problems that I have to tackle next. I am at the point where I just need to do force myself to do something to make progress, acknowledging that it might not be ideal and will require refactoring later. (I actually have pretty clear ideas of how I would naturally do some of this stuff but I am also trying to approach things in a purely “data-oriented” way for this project and that is contributing to some of my hesitancy because I wonder if my natural tendency is the result of old habits.)

I think there are three main things that I want to do next:

  1. Refactor some native D3D12 objects so that I can work with them in a more ergonomic (and potentially platform-independent) way, but more specifically so that I can specify things to render from the application rather than hardcode things in the graphics system
  2. Expand the constant buffer scheme so that I can have a camera and multiple objects with transforms and materials (even if these are very simple initially)
  3. Add the capability to the build system to run an arbitrary command line so that shaders can be compiled offline, both because I want to have shader errors reported at build time but also so that I can start using shader model 6.6 for the new bindless resources in a more sane way

I have also thought of a variation on breakout that seems like a simple way to incorporate some ray tracing, and so, at least for now, I think that might be my initial small proof-of-concept application to try and make, just so that I have some kind of example application rather than the current empty program that does nothing.

Devlog 2024-12-02

This is a regularly-occurring status update. More generally-relevant posts can be found under Features (see Creating a Game and Engine from Scratch for context).

This is the beginning of week 7.

What I have done

  • Finished the intentional tearing demo, which I wrote a post about
    • This was both satisfying and frustrating. I was happy to get something working that I had wanted to do for a long time but also disappointed that I didn’t get better results. I am also frustrated that I still don’t feel like I have as good of an understanding of swap chains and related flipping and timing as I would like.
    • I finally gave up and moved on and will have to revisit some of this when I have more complicated scenes being rendered, but it felt a bit like accepting defeat
  • Improved the code that waits when it happens on the Windows message queue thread
    • Right now the entire application is single threaded, but a rudimentary mechanism is in place to detect the thread (what I am currently calling the “display thread”) and do a different kind of wait so that new messages can be processed if appropriate
  • Render a triangle
    • This was mainly just following the Microsoft Hello-Triangle example, and so not particularly impressive
  • Improved the array view class
    • This is my implementation of std::span, and the initial implementation revealed some problems when dealing with COM ISomething* pointers, both because of pointer const complexity and because of inheritance
    • I wrote a bunch of tests of situations that I could think of and the class now works better. This presented a fun (thought frustrating) challenge to solve with C++ templates.
  • Add WinPixEventRuntime
    • This is the first external code/library that I have included in the project, which is a little unusual for me. Usually I start with {fmt} and Lua in order to have logging and configuration available, but in the current case I am delaying dealing with any strings and so haven’t followed my usual pattern.
    • This integration allows me to annotate captures in PIX for Windows
  • Add precompiled header support in the build system
    • Adding the ability to (force) include files wasn’t particularly difficult, and that was the primary motivator for doing this so that I could start writing and structuring code the way that I wanted to. I thought that going one more step with precompiled headers wouldn’t be too difficult but it ended up being more challenging than I had expected.
    • MSVC has some interesting constraints that I hadn’t been aware of where any files using a precompiled header have to use the same PDB file. Most notably, this means that a single PDB file is created and modified in many different steps by many different files getting compiled, which didn’t fit in well to my build system model where every task graph node (i.e. file) was the result of a single sub task. I think that I figured out a satisfactory way around this, though.
  • Add ability in the build system to query whether a sub task has a specific input
    • This allows me to decide whether a specific application needs the WinPixEventRuntime DLL
  • Add ability in the build system to query for the primary/obvious output target of a sub task
    • This allows me to decide where to stage the WinPixEventRuntime so that it is next to an application executable
  • Add ability in the build system to create files
    • This allows me to programmatically create a header file that #includes the WinPixEventRuntime headers using the current version number, sharing code with the LIB and DLL files

Next Steps

  • An obvious need is to add something to the build system to deal with shaders
    • Right now I just have a manually-copied shader source file in the application root directory that is compiled at run time, which is obviously not ideal
    • I am a little hesitant to start going down the path of figuring out asset building and loading, however, because I don’t have any string solution (recall that a goal of this project is to not use the standard library and to not use the general purpose new allocator, and so dealing with strings and paths is a big task to tackle)
    • With that being said, I could at least create a new type of build system sub task that allows a command line to be run and inputs and outputs to be specified. This would make it possible to work with the shader close to what I’m doing now but not quite so manually, and seems like an ok first step.
  • The other obvious task is to keep adding graphics features beyond the simple triangle and color-changing clears
    • I might at least continue to implement some of the D3D12 “Hello” examples, to increase my familiarity with the changes of D3D12 compared to D3D11 and to give me more time and experience to start thinking about abstractions and how to structure an actual renderer how I would want to