Frame Pacing in a Very Simple Scene

I have recently integrated the Tracy profiler into my engine and it has been a great help to be able to visualize how the CPU and GPU are interacting. Even though what is being rendered is as embarrassingly simple as possible there were some things I had to fix that weren’t behaving as I had intended. Until I saw the data visualized, however, I wasn’t aware that there were problems! I have also been using PIX for Windows, NSight, RenderDoc, and gpuview, but Tracy has really been useful in terms of presenting the information across multiple frames in a way that I can customize to see the relationships that I have wanted to see. I thought that it might be interesting to post about some of the issues with screenshots from the profiler while things are still simple and relatively easy to understand.

Visualizing Multiple Frames

Below is a screenshot of a capture from Tracy:

I have zoomed in at a level where 5 full frames are visible, with a little bit extra at the left and right. You can look for Frame 395, Frame 396, Frame 397, Frame 398, and Frame 399 to see where the frames are divided. These frame boundaries are explicitly marked by me, and I am doing so in a thread dedicated to waiting for IDXGIOutput::WaitForVBlank() and marking the frame; this means that a “frame” in the screenshot above indicates a specific frame of the display’s refresh cycle.

There is a second frame visualization at the top of the screen shot where there are many green and yellow rectangles. Each one of those represents the same kind of frames that were discussed in the previous paragraph, and the purple bar shows where in the timeline I am zoomed into (it’s hard to tell because it’s so small but there are 7 bars within the purple section, corresponding to the 1 + 5 + 1 frames visible at the level of zoom).

In addition to marking frames Tracy allows the user to mark what it calls “zones”. This is a way to subdivide each frame into separate hierarchical sections in order to visualize what is happening at different points in time during a frame. There are currently three threads shown in the capture:

  • The main thread (which is all that my program currently has for doing actual work)
  • An unnamed thread which is the vblank heartbeat thread
  • GPU execution, which is not a CPU thread but instead shows how GPU work lines up with CPU work

In order to try and help me make sure that I was understanding things properly I have color coded some zones specifically according to which swap chain texture is relevant. At the moment my swap chain only has two textures (meaning that there is only a single back buffer at any one time and the two textures just get toggled between being the front buffer or back buffer any time a swap happens) and they are shown with DarkKhaki and SteelBlue. In the heartbeat thread the DisplayFrontBuffer zone is colored according to which texture is actually being displayed during that frame (actually this is not true because of the Desktop Windows Manager compositor, but for the purposes of this post we will pretend that it is true conceptually).

I have used the same colors in the main CPU thread to show which swap chain texture GPU commands are being recorded and submitted for. In other words, the DarkKhaki and SteelBlue colors identify a specific swap chain texture, the heartbeat thread shows when that texture is the front buffer, and the main thread shows when that texture is the back buffer. At the current level of zoom it is hard to read anything in the relevant zones but the colors at least give an idea of when the CPU is doing work for a given swap chain texture before it is displayed.

Unfortunately for this post I don’t think that there is a way to dynamically modify the colors of zones in the GPU timeline (instead it seems to be a requirement that are known at compile time) and so I can’t make the same visual correspondence. From a visualization standpoint I think it would be nice to show some kind of zone for the present queue (using Windows terminology), but even without that it can be understood implicitly. I will discuss the GPU timeline more later in the post when things are zoomed in further.

With all of that explanation let me show the initial screenshot again:

Hopefully it makes some kind of sense now what you are looking at!

Visualizing a Single Frame

Let us now zoom in further to just look at a single frame:

During this Frame 326 we can see that the DarkKhaki texture is being displayed as the front buffer. That means that the SteelBlue texture is the back buffer, which is to say that it is the texture that must be modified so that it can then be shown during Frame 327.

Look at the GPU timeline. There is a very small OliveDrab zone that shows work being done on the GPU. That is where the GPU actually modifies the SteelBlue back buffer texture.

Now look at the CPU timeline. There is a zone called RenderGraphicsFrameOnCpu which is where the CPU is recording the commands for the GPU to execute and then submitting those commands (zones are hierarchical, and so the zones below RenderGraphicsFrameOnCpu are showing it subdivided even further). The color is SteelBlue, and so these GPU commands will modify the texture that was being displayed in Frame 395 and that will again be displayed in Frame 397. You may notice that this section starts before the start of Frame 396, while the SteelBlue texture is still the front buffer and thus is still being displayed! In order to better understand what is happening we can zoom in even further:

Compare this with the previous screenshot. This is the CPU work being done at the end of Frame 365 and the beginning of Frame 366, and it is the work that will determine what is displayed during Frame 367.

The work that is done can be thought of as:

  • On the CPU, record some commands for the GPU to execute
  • On the CPU, submit those commands to the GPU so that it can start executing them
  • On the CPU, submit a swap command to change the newly-modified back buffer into the front buffer at the next vblank after all GPU commands are finished executing
  • On the GPU, execute the commands that were submitted

It is important that the GPU doesn’t start executing any commands that would modify the SteelBlue swap chain texture until that texture becomes the back buffer (and is no longer being displayed). The WaitForSwap zone shows where the CPU is waiting for the swap to happen before submitting the commands (which triggers the GPU to start executing the commands). There is no reason, however, that the CPU can’t record commands ahead of time, as long as those commands aren’t submitted to the GPU until the SteelBlue texture is ready to be modified. This is why the RenderGraphicsFrameOnCPU zone can start early: It records commands for the GPU (you can see a small OliveDrab section where this happens) but then must wait before submitting the commands (the next OliveDrab section).

How early can the CPU start recording commands? There are two different answers to this, depending on how the application works. The simple answer (well, “simple” if you understand D3D12 command allocators) is that recording can start as soon as the GPU has finished executing the commands that were previously submitted that were saved in the memory that the new recording is going to reuse. There is a check for this in my code that is so small that it can only be seen if the profiler is zoomed in even further

The reason that this wait is so short is because the GPU work being done is so simple that it reached the submitted swap long before the CPU checked to make sure.

Do you see that long line between executing the GPU commands and then recording new ones on the CPU? With the small amount of GPU work that my program is currently doing (clearing the texture and then drawing two quads) there isn’t anything to wait for by the time I am ready to start recording new commands.

If you’ve been following you might be asking yourself why I don’t start recording GPU commands even sooner. Based on what I’ve explained above the program could be even more efficient and start recording commands as soon as the GPU was finished executing the previous commands, and this would definitely be a valid strategy with the simple program that I have right now:

This is a capture that I made after I modified my program to record new GPU commands as soon as possible. The WaitForPredictedVblank zone is gone, the WaitForGpuToReachSwap zone is now visible at this level of zoom, and the WaitForSwap zone is now bigger. The overlapping of DarkKhaki and SteelBlue is much more pronounced because the CPU is starting to work on rendering a new version of the swap chain texture as soon as that swap chain texture is displayed to the user as a front buffer (although notice that the commands still aren’t submitted to the GPU until after the swap happens and the texture is no longer displayed to the user). Based on my understanding this kind of scheduling probably represents something close to the ideal situation if 1) a program wants to use vsync and 2) knows that it can render everything fast enough within one display refresh and 3) doesn’t have to worry about user input.

The next section explains what the WaitForPredictedVblank is for and why user input makes the idealized screenshot above not as good as it might at first seem.

When to Start Recording GPU Commands

Earlier I said that there were two different answers to the question of how early the CPU can start recording commands for the GPU. In my profile screenshots there is a DarkRed zone called WaitForPredictedVblank that I haven’t explained yet, but we did observe that it could be removed and that doing so allowed even more efficient scheduling of work. This WaitForPredictedVblank zone is related to the second alternate answer of when to start recording commands.

My end goal is to make a game, which means that the application is interactive and can be influenced by the player. If my program weren’t interactive but instead just had to render predetermined frames as efficiently as possible (something like a video player, for example) then it would make sense to start recording commands for the GPU as soon as possible (as shown in the previous section). The requirement to be interactive, however, makes things more complicated.

The results of an interactive program are non-deterministic. In the context of the current discussion this can be thought of as an additional constraint on when commands for the GPU can start being recorded, which is so simple that it is kind of funny to write out: Commands for the GPU to execute can’t start being recorded until it is known what the commands for the GPU to execute should be. The amount of time between recording GPU commands and the results of executing those commands being displayed has a direct relationship to the latency between a user providing input and the user seeing the result of that input on a display. The later that the contents of a rendered frame are determined the less latency the user will experience.

All of that is a long way of explaining what the WaitForPredictedVblank zone is: It is a placeholder in my engine for dealing with game logic and simulation updates. I can predict when the next vblank is (see the Syncing without VSync post for more details), and I am using that as a target for when to start recording the next frame. Since I don’t actually have any work to do yet I am doing a Sleep() in Windows, and since the results of sleeping have limited precision I only sleep until relatively close to the predicted vblank and then wait on the more reliable swap chain waitable object (this is the WaitForSwap zone):

(Side note: Being able to visualize this in the instrumented profile gives more evidence that my method of predicting when the vblank will happen is pretty reliable, which is gratifying.)

The next step will be to implement simulation updates using fixed timesteps and then record GPU commands at the appropriate time, interpolating between the two appropriate simulation updates. That will remove the big WaitForPredictedVblank, and instead there will be some form of individual simulation updates which should be visible.

Conclusion

If you’ve made it this far congratulations! I will show the initial screenshot one more time, showing the current state of my engine’s rendering and how work for the GPU is scheduled, recorded, and submitted:

Syncing without VSync

I don’t remember how many years ago it was when I first read about doing this, but it was probably the following page which introduced me to the idea: https://blurbusters.com/blur-busters-lagless-raster-follower-algorithm-for-emulator-developers/. I am pleased to be able to say that I have finally spent the time to try and implement it myself:

How it Works

An application can synchronize with a display’s refresh cycle (instead of using vsync to do it) if two pieces of data are known:

  • The length of a single refresh (i.e. the refresh rate)
    • This is a duration
  • The time when a refresh ends and the next one starts (i.e. the vblank)
    • This is a repeating timestamp (a moment in time)

If both of these are known then the application can predict when the next refresh will start and update the texture that the graphics card is sending to the display at the appropriate time. If the graphics card changes the texture at the wrong time then “tearing” is visible, which is a horizontal line that separates the old previous texture above it and the new texture below it. (This Wikipedia article as a simulated example image of tearing.)

The texture that is active that the graphics card is sending to the display is called the “front buffer”. The texture that isn’t visible but that the application can generate before it is activated and being sent to the display is called the “back buffer”. There is different terminology for the act of making a back buffer into a front buffer but this post will call it “swapping”, and conceptually it can be thought of as treating the former back buffer as the new front buffer while what was the front buffer becomes a back buffer.

(What vsync does is take care of swapping front and back buffers at the appropriate time. If the application finishes generating a new texture in the back buffer and submits it in time then the operating system and graphics card will work together to make sure that it becomes the new front buffer for the display during the vblank. If the application doesn’t submit it in time and misses the vblank then nothing changes visually: The display keeps showing the old front buffer (meaning that a frame is repeated) and there is no tearing).

Why does swapping the front and back buffers at the wrong time cause tearing? Because even though swapping the front and back buffers happens instantaneously in computer memory the change from one frame to another on a display doesn’t. Instead the display updates gradually, starting at the top and ending at the bottom. Although this isn’t visible to a human eye the effect can be observed using a slow motion camera:

With that in mind it is possible to understand what I am doing in my video: I wrote a program that is manually swapping front and back buffers four times every refresh period in order to intentionally cause tearing. I am not actually rendering anything but instead just doing a very simple clearing of the entire back buffer to some single color, but by swapping a single color back buffer at the correct time in the middle of a display’s refresh I can change the color somewhere in the middle of the display.

Doing this isn’t particularly new and my results aren’t particularly impressive, but I was happy to finally find the time to try it.

Here is a bonus image where I wasn’t trying to time anything and instead just swapped between alternating red and green as quickly as I was able to:

Implementation Details

The rest of this post contains some additional technical information that I encountered while implementing this. Unlike the preceding section which was aimed at a more general audience the remainder will be for programmers who are interested in specific implementation details.

How to Calculate Time

Calculating the duration of a refresh period isn’t particularly difficult but it’s not sufficient to simply use the reported refresh rate. Although the nominal refresh rate would be close it wouldn’t exactly match the time reported by your CPU clock, and that’s what matters because that’s what you’ll be using to know when to swap buffers. In order to know what the refresh rate is in terms of the CPU clock an average of observed durations must be made. In order to calculate a duration it is necessary to keep track of how much time has elapsed between consistently repeating samples, but it doesn’t actually matter where in the refresh cycle these samples come from as long as they are consistently taken from the same (arbitrary) point in the refresh cycle. So, for example, timing after IDXGISwapChain::Present() returns (with a sync interval of 1 and a full present queue) would work, and timing after IDXGIOutput::WaitForVBlank() returns would also work.

It is more difficult, however, to calculate when the vblank actually happens.

DXGI_FRAME_STATISTICS

I finally settled on using IDXGISwapChain::GetFrameStatistics(). Using this meant that I was relying on DXGI instead of taking my own time measurements, but the big attraction of doing that is that the timestamps were then tied directly to discrete counters. Additionally, as a side benefit, after a bit of empirical testing it seemed like the sampled time in the DXGI frame statistics was always earlier than the time that I could sample myself, and so it seems like it is probably closer to the actual vblank than anything that I knew how to measure.

(The somewhat similar DwmGetCompositionTimingInfo() did not end up being as useful for me as I had initially thought. Alternatively, D3DKMTGetScanLine() seems like it could, in theory, be used for even more accurate results, but it wasn’t tied to discrete frame counters which made it more daunting. If my end goal had been just this particular demo I might have tried using that, but for my actual game engine renderer it seemed like IDXGISwapChain::GetFrameStatistics() would be easier, simpler and more robust.)

The problem that I ran into, however, is that I couldn’t find satisfactory explanations of what the fields of DXGI_FRAME_STATISTICS actually mean. I had to spend a lot of time doing empirical tests to figure it out myself, and I am going to document my findings here. If you found this post using an internet search for any of these DXGI_FRAME_STATISTICS-related terms then I hope this explanation saves you some time. (Alternatively, if you are reading this and find any mistakes in my understanding then please comment with corrections both for me and other readers.)

My Findings

The results of IDXGISwapChain::GetFrameStatistics() are a snapshot in time.

If you call IDXGISwapChain::GetLastPresentCount() immediately after IDXGISwapChain::Present() you will get the correct identifier for the present call that you just barely made, and this is very important to do in order to be able to correctly associate an individual present function call that you made with the information in the DXGI frame statistics (or, at least, it is conceptually important to do; you can also just keep track yourself of how many successful requests to present have been made).

On the other hand, if you call IDXGISwapChain::GetFrameStatistics() immediately after IDXGISwapChain::Present() there is no guarantee that you will get updated statistics (and, in fact, you most likely won’t). Instead, there is some non-deterministic (for you) moment in time after calling IDXGISwapChain::Present() where you would eventually get statistics for that specific request to present in the results of a call to IDXGISwapChain::GetFrameStatistics().

How do you know if the statistics you get are the ones that you want? You know that they are the ones that you want if the PresentCount field matches the value you got from IDXGISwapChain::GetLastPresentCount() after IDXGISwapChain::Present(). Once you call IDXGISwapChain::GetFrameStatistics() and get a PresentCount that matches the one that you’re looking for then you know two things:

  • The statistics that you now have refer to the known state of things when your submitted request to present (made by your call to IDXGISwapChain::Present() ) was actually presented
  • The statistics that you now have will not be updated again for your specific present request. What you now have is the snapshot that was made for your PresentCount , and no more snapshots will be made until another call to IDXGISwapChain::Present() is made (which means that the next time the statistics get updated they will be referring to a different PresentCount from the one that you are currently interested in).

Once you have a DXGI_FRAME_STATISTICS that is a snapshot for your specific PresentCount the important corresponding number is PresentRefreshCount. This tells which refresh of the display your request to present was actually presented during. If vsync is enabled PresentRefreshCount is the refresh of the display when your request to present was actually presented.

Once you have that information you can, incidentally, detect whether your request to present actually happened when you wanted and expected it to. This is described at https://learn.microsoft.com/en-us/windows/win32/direct3ddxgi/dxgi-flip-model#avoiding-detecting-and-recovering-from-glitches, in the “to detect a glitch” section. Although the description of what PresentCount and PresentRefreshCount is confusing to me in that document (and in other official documentation) the description of how to detect a glitch is consistent in my mind with how I have described these fields above, which helps to give me confidence that my understanding is probably correct.

Once you know the information above you can now potentially get timing information. The SyncRefreshCount refers to the same thing as PresentRefreshCount (it is a counter of display refresh cycles), and so it may be confusing why two different fields exist and what the distinction is between the two. PresentRefreshCount is, as described above, a mapping between PresentCount and a display refresh. SyncRefreshCount, on the other hand, is a mapping between the value in SyncQPCTime and a display refresh. The value in SyncQPCTime is a timestamp corresponding to the refresh in SyncRefreshCount. If SyncRefreshCount is the same as PresentRefreshCount then you know (approximately) the time of the vblank when your PresentCount request was actually displayed. It is possible, however, for SyncRefreshCount to be different from PresentRefreshCount, and that is why both fields are in the statistics struct.

To repeat: Information #1 is which display refresh your request was actually displayed in (comparing PresentCount to PresentRefreshCount) and information #2 is what the (approximate) time of a vblank for a specific refresh was (comparing SyncQPCTime to SyncRefreshCount). Derived information #3 is what the (approximate) time of a vblank was for the refresh that your request was actually displayed in.

(Side note: The official documentation here and here is very intentionally vague about when SyncQPCTime is actually measured. The driver documentation here, however, says “CPU time that the vertical retrace started”. I’m not sure if the more accessible user-facing documentation is intentionally vague to not be held accountable for how accurate the timing information is, or if the driver documentation is out-of-date. This post chooses to believe that the time is supposed to refer to the beginning of a refresh, with the caveat that I may be wrong and that even if I’m not wrong the sampled time is clearly not guaranteed to be highly accurate.)

One final thing to mention: A call to IDXGISwapChain::GetFrameStatistics() may return DXGI_ERROR_FRAME_STATISTICS_DISJOINT. One thing to note is that the values in PresentRefreshCount and SyncRefreshCount are monotonically-increasing and, specifically, they don’t reset even when the refresh rate changes. The consequence of this is that the DXGI_ERROR_FRAME_STATISTICS_DISJOINT result is very important for determining timing (like this post is concerned about). If you record the first PresentRefreshCount reported in the first successful call after DXGI_ERROR_FRAME_STATISTICS_DISJOINT was returned then you have a reference point for any future SyncRefreshCounts reported (until the next DXGI_ERROR_FRAME_STATISTICS_DISJOINT). Specifically, you know how many refresh cycles have happened with the current refresh rate.

How to Calculate Refresh Period

Calculating the refresh period using SyncRefreshCount and SyncQPCTime is not difficult: Average the elapsed time between the sampled timestamps of refreshes. I am using the incremental method described here: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford’s_online_algorithm. This is easy to calculate and doesn’t require any storage beyond the current mean and the sample count. It can have problems with outliers or if the duration changes, though, and although I don’t anticipate either of those being issues it remains to be seen.

How to Predict VBlanks

I did some thinking about how to do this, and after some internet searching (I am not a numerical methods expert and so it took me a while to even figure out the correct search terms for what I was thinking) I found a really nice post about how to do exactly what I wanted, an incrementally-updating method for calculating the line that is the least-squares best fit for a bunch of sample points: https://blog.demofox.org/2016/12/22/incremental-least-squares-curve-fitting/. I liked this because it was a match for the incremental average function I was using, and since refresh cycles should happen regularly I figured that I could use SyncRefreshCount as the independent variable and SyncQPCTime as the dependent variable and then have a really computationally cheap way of predicting the time for any given refresh (in the past or in the future).

The good news is that this worked really well after some tweaking of my initial naive approach! The bad news is that the changes that I had to make in order to make it work well made me nervous about whether it would continue to perform well over time.

The big problem was losing precision. The SyncRefreshCount are inherently big numbers, but I already had to do some normalizing anyway so that they started at zero (see discussion above about DXGI_ERROR_FRAME_STATISTICS_DISJOINT) and so that didn’t seem so bad. SyncQPCTime, however, are also big numbers. The same trick of starting at zero can be used, and I also represented them as seconds (instead of the Windows high performance counts) and this helped me to get good results. I was worried about the long-term viability of this, however: Unlike the incremental method for the average this method required squaring numbers and multiplying big numbers together, and these numbers would constantly increase over time.

Even though I was quite happy with finding an algorithm that did what I had thought of, once I had implemented it there was still something that bothered me: I was trying to come up with a line equation, where the coefficients are the slope and the y-intercept. I already knew the slope, though, because I had a very good estimate of the duration of a refresh. In other words, I was solving for two unknowns using a bunch of sample points, but I already knew one of those unknowns! What I really wanted was to start with the slope and then come up with an estimate of the y-intercept only, and so it felt like the method I was using should be constrainable even more.

With that in mind I eventually came up with what I think, in hindsight, is a better solution even aside from precision issues. I know the “exact” duration between every vblank (we will conceptually consider that to be known, even though it’s an estimate), and for each reported sample I know the exact number of refreshes since the initial starting point (which is a nice discrete integer value), and then I know the approximate sampled time, which is the noisy repeated sample data I am getting that I want to improve in order to make predictions. What I can do, then, is to calculate what the initial starting time (for refresh count 0) would be, and incrementally calculate an average of that. This gives me the same cheap way of calculating the prediction (just a slope (refresh period) and a y-intercept (this initial timestamp)), but also a cheap way of updating this estimate (using the same incrementally-updating average method that I discussed above). And, even better, I can update it every frame without worrying about numerical problems. (Eventually with enough sample counts there will be issues with the updated value being too small, but that won’t impact the accuracy of the current average if we assume that it is very accurate by then.)

This means that I don’t have to spend time initially waiting for the averages to converge; instead I can just start with single sample that is already a reasonably good estimate and then proceed with normal rendering, knowing that my two running averages will keep getting more accurate over time with every new DXGI frame statistic that I get with new SyncRefresh information.