Matching Rendered Exposure to Photographs – Updated

Different amounts of exposure can make a photograph brighter or darker. I wanted to emulate camera exposure controls in my rendering software so that exposure of a generated image would behave the same way that it does when using a real camera to take a photograph, and in order to evaluate how well my exposure controls were working I compared real photographs that I had taken with images that I had generated programmatically. This post shows some of those comparisons and discusses issues that I ran into.

There is a lot of initial material in this post to motivate and explain the results but if you want to skip directly to see the comparisons between real photographs and rendered images search for “Rendered Exposure with Indirect Lighting” and you will find the appropriate section.

(There was an original older post that I have now removed; this post that you are currently reading contains updated information, is more accurate, and has more satisfying results.)

Light Sources

There are two different light sources that I used when taking photographs.

Incandescent Bulb

One light that I used was an incandescent bulb:

An incandescent light bulb that outputs 190 lumens

I bought this years ago when they were still available and intentionally chose one with a round bulb that was clear (as opposed to frosted or “soft”) to use for comparisons like in this post so that the emitted light could be approximated analytically by black-body incandescence. It is a 40 W bulb and the box explicitly says that it outputs 190 lumens. The box doesn’t list the temperature although I did find something online for the model that said it was 2700 K. That number is clearly just the standard temperature that is always reported for tungsten incandescent bulbs but for this post we will assume that it is accurate. The following is what a photograph from my phone looks like when the incandescent bulb is turned on in a small room with no other lights on when I allow my phone to automatically choose settings:

LED Bulb

The other light that I used is a new LED bulb that I bought specifically for these current experiments:

An LED light bulb that outputs 800 lumens

The box says that it outputs 800 lumens, which is slightly more than 4 times the output of the 190 lumen incandescent bulb. This is a “smart” bulb which I chose so that I could change brightness and colors with my phone, but the app unfortunately doesn’t show any quantitative information. There are two modes: White mode (with a slider to control temperature and a few qualitatively-named presets but with no numerical temperature values) and color/saturation mode (which is significantly dimmer than white mode). There is also a brightness slider that shows a percent but doesn’t further quantify what that means. The box shows the range of white temperatures and has 2700 K as the lower limit, and so all of the photographs of the bulb in this post were taken in white mode with the color temperature as low/warm as possible and the brightness as high as possible, which we will assume is 2700 K and 800 lumens. This is what a photograph from my phone looks like when the LED bulb is turned on in the same small test room with no other lights on when I allow my phone to automatically choose settings:

800 lumens that hopefully has the accurate color of 2700 K

Perceptual Accuracy

Neither of the two photos shown above with the bulbs turned on really capture what my eyes perceived. The color has been automatically adjusted by my phone to use D65 as the white point and so the photos feel more neutral and not particularly “warm” (i.e. there is less of a sense of yellow/orange than what my brain interpreted from my eyes). Later in this post when I show the controlled test photos they will have the opposite problem and look much (much) more orange than what my eyes perceived. Even though the photos above don’t quite match what my eyes perceived they are significantly closer than the orange images that will be shown later and should give you a reasonably good idea of what the room setup looked like for the experiment.

It isn’t just the color that isn’t accurately represented, however. The photos also don’t really convey the relative brightness of the two different bulbs. The LED bulb emits about ~4x more luminous energy than the incandescent bulb in the same amount of time but this isn’t really obvious in the photos that my phone took. Instead the amount of light from each bulb looks pretty similar because my phone camera has done automatic exposure adjustments. Later in this post when I show the controlled test photos it will be more clear that the two bulbs emit different amounts of light.

These initial photos are useful for providing some context for the later test photos. They show the actual light bulbs which are not visible in any of the later photos, and you can also form an idea of what the test room looks like, including the light switches and their plate which is in all of the test photos and rendered images.

Exposure Value (“EV”)

The property that I want to let a user select in my software in order to control the exposure is something called EV, which is short for “exposure value”. The first tests that I wanted to do with my DLSR camera were to:

Make sure that I correctly understood how EVs behave
Make sure that my camera’s behavior matched the theory behind EVs

The term “EV” has several different but related meanings. When one is familiar with the concepts there isn’t really any danger of ambiguity but it can be confusing otherwise (it certainly is for me as only a casual photographer). In this post I am specifically talking about the more traditional/original use of EV, which is a standardized system to figure out how to change different camera settings in order to achieve a desired camera exposure. This is the main Wikipedia page for Exposure Value but this section of the article on “EV as an indicator of camera settings” is the specific meaning of the term that I refer to in this post.

There are different combinations of camera aperture and shutter speed settings that can be selected that result in the same exposure. If we arbitrarily choose EV 7 as an example of a desired exposure we can see the following values (from the second Wikipedia link in the previous paragraph) that can be chosen on a camera to achieve an exposure of EV 7 (consult the full table on Wikipedia for more values):

f-number	2.8	4.0	5.6	8.0	11	16
shutter speed	1/15	1/8	1/4	1/2	1	2

For example, if I wanted the final exposure of a photograph to be EV 7 and I wanted my aperture to have an f-number of 4 then I would set my shutter speed to be 1/8 of a second. If, on the other hand, I wanted my shutter speed to be 1 second but still wanted the final exposure of the photograph to be EV 7 then I would set my aperture to 11.

One way to think about this is that the aperture and shutter speeds are real physical characteristics of the camera but EV is a derived conceptual value that helps photographers figure out how they would have to adjust one of the real physical characteristics to compensate for adjusting one of the other real physical characteristics. (There is also a third camera setting, ISO speed, which factors into exposure but that this particular Wikipedia table doesn’t show. Conceptually, a full EV table would have three dimensions for aperture, shutter speed, and ISO speed.)

When I personally am casually taking photos I intuitively think of “exposure” as a way to make the image brighter or darker (using exposure compensation on my camera), and in this post I sometimes informally talk the same way. A more technically correct way to think about exposure, however, is to recognize that it is related to time: The shutter speed determines how long the sensor is recording light, and the longer that the sensor is recording light the brighter the final image will be. If all other settings remain equal then doubling the amount of time that the shutter is open doubles the amount of light that is recorded (because the camera’s sensor records light for double the amount of time) and that results in the the final photo being brighter. If, on the other hand, the desire is to double the amount of time that the shutter is open but still record the same amount of light (i.e. to not make the final photo brighter even though the new shutter speed causes the sensor to record light for double the amount of time) then something else must be changed to compensate (specifically, either the aperture can be made smaller so that half the amount of light reaches the sensor in the same amount of time or the ISO speed can change so that the sensor records half the amount of light that reaches it in the same amount of time).

For photography there is much more that could be said about this topic but for the purpose of me trying to implement exposure control for graphics rendering this simple explanation is a good enough conceptual basis. The summary that you should understand to continue reading is: “Exposure” determines how bright or dark a photograph is (based on how much time the light from a scene is recorded by a camera sensor and how sensitive to light that sensor is while recording), and EV (short for “exposure value”) is a convenient and intuitive way to classify different exposures. EV is convenient and intuitive because it is a single number that abstracts away the technical details like shutter speed, aperture size, and ISO film speed.

EV 7

Let me repeat the table of EV 7 values shown above:

f-number	2.8	4.0	5.6	8.0	11	16
shutter speed	1/15	1/8	1/4	1/2	1	2

The first thing that I needed to verify was if I took photographs with the settings shown in that chart whether they would all appear to be the same brightness (i.e. would they have the same exposure). The photographs below were taken with the incandescent 190 lumen light bulb and with the camera set to ISO speed 100, using the values shown in that table.

f/4.0, 1/8 second:

f/5.6, 1/4 second:

f/8.0, 1/2 second:

f/11, 1 second:

It probably isn’t very exciting looking at these photos because they all look the same but it was exciting when I was taking them and seeing that practice agreed with theory! (If you can’t see any differences you could open the images in separate tabs and switch between them to see that yes, they are different photographs with slight changes.)

If this post were focused on photography I could have come up with a better demonstration to show the effects of different aperture size or shutter speed, but what I cared about was exposure and I needed the photographs to be in this very controlled and simple room so that I would be able to try and recreate the room in software in order to generate images programmatically for comparison.

Why Orange?

Something that you might have noticed immediately is how orange the exposure test photographs are. This is the same room (the same wall and the same light switches) that we saw earlier, with the same light bulb as the only light source in the room:

But it looks very different in the exposure test photographs:

The photos are taken with different cameras (the exposure test photos were taken with a DSLR camera instead of a phone camera), but this isn’t why the color looks so different (the DSLR camera would have produced non-orange photos if I had used its default settings). The reason that the color is so orange in the exposure tests is because I was explicitly trying to set the white point of the camera to be sRGB white (i.e. D65). My specific camera doesn’t let me explicitly choose sRGB as a white point by name, unfortunately, and so instead I calibrated the camera’s white point to a white sRGB display so that it should at least be close (depending on how accurately calibrated my sRGB display was and how accurate the camera’s white point calibration is).

As I mentioned earlier the orange color in the controlled test photos should not be understood to represent what my eyes perceived (it doesn’t!) but is instead a consequence of me attempting to accurately recreate what the camera saw. In the images that I am going to generate programmatically I am not going to do any white balancing and so I wanted to try and convince my camera to do the same thing, and this meant having the camera show the color that its sensor recorded in the context of D65 being white.

If you don’t understand what I am talking about in this section don’t worry; it is not necessary to understand these concepts in order to understand this post and I am intentionally not explaining any details. The important thing to understand is: The orangeish hue in the photos isn’t really what it looked like to my eyes but should theoretically match the images that I am going to generate.

Photographs of Different EVs

The next step was to take photographs with different exposure values so that I had references to compare with any images that I would programmatically generate. The light bulb in all of the photographs is on the floor, below and slightly to the right of the camera. The room is a very small space (there is a toilet and nothing else; specifically, there are no windows and no irregular corners or other features besides the door and trim) and the paint is fairly neutral (you can see in the photos from my phone camera that the wall paint is a different shade of white from the light switches but it is still unquestionably “white”). The floor is grey tile with lots of variation in the shades of grey; this is not ideal for the current experiment but at least there are no non-neutral hues. All of the test photos shown in the rest of this post use f/4.0 and ISO 100 (meaning that when the EV changes it indicates that I only changed the shutter speed and that the other settings stayed the same).

Incandescent 190 Lumens

EV₁₀₀ 4:

EV₁₀₀ 5:

EV₁₀₀ 6:

EV₁₀₀ 7:

The photos show the difference in apparent brightness when moving from one EV to another. This difference is called a “stop” in photography. The definitions of EV and stops use a base-2 logarithm which means that a change of one stop (i.e. a change from EV to another) corresponds to either a doubling of exposure or a halving of exposure. If you look at the photographs above with different EV levels it might not look like there is twice the amount or half the amount of recorded light when comparing neighboring EVs but this is because our human eyes don’t perceive these kinds of differences linearly. This is counterintuitive and can be confusing if you aren’t used to the idea and my short sentence is definitely not enough explanation; if you don’t understand this paragraph the important thing to understand for this post is that a mathematical doubling or halving of light (or brightness, lightness, luminance, and similar terms) doesn’t appear twice as bright or half as bright to human eyes.

The difference in apparent brightness between different EVs was pretty interesting to me. I didn’t have a strong preconceived idea ahead of time of what the difference of a stop would look like. The exposure compensation controls on my camera change in steps of 1/3 of an EV (i.e. I have to adjust three times to compensate by a full stop) but as a very casual photographer when I am shooting something I just kind of experiment with exposure compensation to get the result that I want rather than having a clear prediction of how much an adjustment will change things visually (also, since the quality of smart phone cameras has gotten so good I only rarely use my DSLR camera now and so maybe I’ve forgotten what I used to have a better intuitive feel for years ago). I knew that the difference wouldn’t be perceptually linear but before doing these experiments I would have predicted that a change of 2x the amount of luminance would have been more perceptually different than what we see in the photos above. Even though my prediction would have been wrong, however, I am quite pleased with the actual results that we do see. It is easy to see why EVs/stops are a good standard because they feel like very intuitive perceptual differences.

The reason that I wanted to experiment with using EV for exposure was because it was an existing photographic standard (and it seemed like everyone was using it) but seeing these results is encouraging because it suggests that EV is not just a standard but is probably a good standard. Being able to specify and change exposure in terms of EVs or stops seems intuitive and worth emulating in software.

LED 800 Lumens

This section shows photographs of the brighter bulb taken with different EV settings.

EV₁₀₀ 4:

EV₁₀₀ 5:

EV₁₀₀ 6:

EV₁₀₀ 7:

EV₁₀₀ 8:

EV₁₀₀ 9:

EV₁₀₀ 10:

EV₁₀₀ 11:

EV₁₀₀ 12:

Starting with the same EV as I did with the dimmer 190 lumen bulb (EV₁₀₀ 4) clearly leads to the image being very overexposed with the brighter 800 lumen bulb. Comparing different bulbs with different outputs with the same EV doesn’t make much sense, but there is an easy sanity check that can be done to convert EVs: The 800 lumen bulb emits ~4 times more luminous energy than the 190 lumen bulb in the same amount of time (190 is almost 200 and $200 * 4 = 800$) and if every increment of EV (every additional stop) results in a doubling of recorded luminance then we can predict that a difference of 2 EVs would result in about the same appearance between the two different bulbs (2 * 2 = 4). The following photographs compare EV 4 with 190 lumens and EV 6 with 800 lumens (because 6 – 4 = 2):

And sure enough the two photos with different light bulbs look pretty close to each other! Again, I know that images that look the same are inherently not exciting but I think it is actually very cool that two different photographs with different light bulbs and different camera settings look the same, and especially that they look the same not because I changed settings empirically trying to make them look the same but instead only based on following what the math would predict. (This ability to easily and reliably predict exposure was why the EV system was invented and was especially useful before digital cameras when the actual photographs wouldn’t be seen until long after they had been taken.)

Let’s try one more comparison with the dimmest photos of the incandescent bulb that I took. The following compares EV₁₀₀ 7 with 190 lumens and EV₁₀₀ 9 with 800 lumens (because 9 – 7 = 2):

Cool, these two photographs also look very similar!

Rendering Graphics

With the experimental foundation in place the next step was to generate the same scene programmatically and then see if setting different exposure values set via code had the same effect on the rendered images that it had on the photographs.

When I originally made this post I was rendering everything in real time (i.e. at interactive rates) using GPU hardware ray tracing because that is my ultimate goal. It turned out, however, that I was taking too many shortcuts and didn’t have a completely reliable mathematical foundation. I had implemented Monte Carlo path tracing when I was a student but not since and it turned out that there were a lot of details that I didn’t quite remember about getting the probabilities correct (or maybe I just wasn’t as fastidious back then when I wasn’t trying to compare my renderings to actual photographs); to be fair to myself I think my generated images would have been basically correct if I had chosen a better wall color (discussed later) but there were several tricky details that I didn’t fully understand or handle. For this updated post I have reimplemented everything in software and spent a lot of time doing experiments to try and make sure (as much as I can think of ways to verify) that I am calculating everything correctly when generating the images.

The scope of what I have been doing has accidentally spread beyond my original intention (which was to focus solely on exposure) and now also encompasses indirect lighting (aka global illumination) since it was necessary to implement that properly in order to make fair comparisons with photographs.

Instead of using my interactive GPU ray tracer I have created a software path tracer which was used to generate all of the rendered images that I show in this post. The goal of this software path tracer was to be a reference that I could rely on to serve as ground truth. The goal was not to make this reference path tracer fast, and I haven’t really spent the time to implement any performance optimizations besides ones related to probability and reducing variance (i.e. ones that will also be useful in real-time on the GPU). The result of the emphasis on correctness but not performance is that the images in this post take a very long time to generate and because I didn’t want to wait too long they all have some noise.

Exposure Scale

To take the raw luminance values and simulate the exposure of a camera based on a given exposure value I mostly followed the method from the Moving Frostbit to PBR course notes. One difference is that instead of using 0.65 for the “q” value I used $\frac{\pi}{4}$ (which is about ~0.785), but otherwise the equation for calculating an exposure scale is the same as what the Frostbite paper gives:

\[(\frac{78}{100 \cdot \frac{\pi}{4}}) \cdot 2^{EV_{100}}\]

Where EV₁₀₀ is a user-selectable exposure value (assuming an ISO film speed of 100). The raw luminance arriving at the camera is divided by this value to calculate the exposed luminance.

I write more about this exposure scale in the supplemental section at the end of this post for anyone interested in technical details. The important thing to understand for reading this post is that an intuitive EV₁₀₀ value can be chosen when generating an image (just like an EV₁₀₀ value can be chosen when taking a photograph) and this chosen EV₁₀₀ value is converted into a value for shading that scales the physical light arriving at the camera to simulate how much of that light would be recorded while the camera’s shutter was open.

Light Color

As discussed earlier these exposure tests intentionally don’t include any white balance adjustments. In order to match the color shown in the photographs it is necessary to calculate the color of a 2700 K light source.

The sRGB gamma-encoded 8-bit RGB values of a an idealized 2700 K black-body radiator are [255, 170, 84], the color of the background of this text block. If you compare this with a photograph you can see that it is a reasonably close match:

I write more about how this color is calculated in the supplemental section at the end of this post for anyone interested in technical details. The important thing to understand for reading this post is that the color that my program is using for the light bulbs is what is shown as the background of this paragraph, and that this color should also theoretically match what my camera senses even if it looks different to human eyes.

Programmatic Light Luminance

Even though calculating the color of a 2700 K black body also tells me what the luminance of an idealized light bulb filament would be this value can’t be used for these experimental tests because I don’t know the surface area of the bulb’s filament (and the LED bulb doesn’t even have a filament and its light isn’t generated by incandescence). Instead of using the theoretical black-body luminance when rendering images the luminance for the light sources must be derived from the light output (the luminous flux) that the manufacturers report for the bulbs. This reported flux is 190 lumens for the dimmer incandescent bulb and 800 lumens for the brighter LED bulb.

If a spherical light source emits luminance uniformly in every direction from every point on its surface then that single uniform luminance value can be calculated from the luminous flux as:

\[luminance = \frac{fluxInLumens}{4 \cdot \pi^2 \cdot radiusOfLightSphere^2}\]

This means that regardless of the size of the light bulb that I use in code the correct amount of luminous flux will be emitted (and I can just have a magically floating sphere that emits light and I don’t have to worry about e.g. the base of the physical light bulb where no light would be emitted by the physical bulbs).

I write more about how this number is calculated in the supplemental section at the end of this post for anyone interested in technical details. The important thing to understand for reading this post is that the number of lumens reported on the packaging of the physical light bulbs can be converted to a number that represents the amount of light emitted from a single direction from a perfect sphere.

Geometry

I only have two rooms in my house without windows and both of them are bathrooms. One of those bathrooms has a mirror, polished countertop (and cabinet), a sloping ceiling, and a bathtub; all of that makes it less-than-ideal for recreating in software accurately enough for tests like I wanted to do. The other bathroom, however, is much better for that purpose: It is very small, is a nice box shape (all corners are right angles) and the only thing in it is a toilet. The walls and ceiling (and light switches, door, trim and toilet) are all white. The floor is grey tile, but at least quite matte. This is the room that I used for all of the photographs and generated images in this post.

I tried to measure the room accurately but I only had a standard tape measure to work with and just accepted that the measurements that I came up with might be close but would definitely not be perfect. I put my camera on a tripod facing the wall (away from the toilet 🙂 ). I only have one lens for my DSLR camera and it can change its zoom, but I looked up the specs and tried to match the zoom and angle of view and hoped that I measured close to where the origin of the camera should be.

Initially I had just modeled the walls of the room (I don’t have any geometry importing yet in my engine and so it is a bit laborious and annoying to type the geometry data in as code), but after looking at the images that I was generating I decided to add an additional box on the wall to represent the light switch plate. This gives some reference to compare with the photographs and made me more confident that I was at least close to recreating the real scene in my program.

The following image gives an idea of the room configuration:

The virtual camera in this particular image is placed further back from where the physical camera was in the actual photographs in order to show more of the room (the later images generated for comparison will have the camera in the correct location to match the photographs); in this view there is a wall close behind the virtual camera and so this image shows what it would look like if you were in the room sitting on a chair placed against the back wall (it is a very small room). The exposure is only a linear scale (no tone mapping) and so the walls and floor near the light bulb are completely overexposed (any values greater than one after the exposure scale is applied are simply clamped to one). Below is the same image but with a different exposure to show where the light bulb is and how big it is:

The center of the light sphere is where my measurements estimated the smaller incandescent bulb was. The actual sphere radius, however, is what I measured the LED bulb to be, which is a bit bigger than the incandescent bulb. If my calculations to derive luminance from flux were done correctly then this won’t matter because the lumens will be the same regardless of the radius.

The sphere looks like it’s floating above the floor because the actual physical light bulb was on the little stand that can be seen in earlier photographs:

I didn’t model the stand, but measured where the center of the incandescent bulb was when the stand was on the floor to try and place the virtual sphere so that it was centered at the same location as the center of the incandescent light bulb (which was slightly lower than the center of the LED bulb when it was in the stand).

Results

In my original post I presented the results in roughly the same order that I produced them and tried to document problems and subsequent solutions because I think it’s interesting and valuable to read the troubleshooting process in posts like these. Because so many of my conclusions turned out to be either inaccurate or fully wrong, however, I am presenting things in this updated post in a different order than what I discovered chronologically and no longer try to show the incremental improvements.

One big problem that I ran into, however, is worth mentioning and showing the changes that were required to fix it: In the initial version of this post I was treating all of the surfaces in the room (the walls, ceiling, and floor) as perfectly white because (I thought) it was a nice simplifying assumption. Spoiler alert, this turned out to be a bad idea! When I only rendered direct lighting I was getting results that were what I would have expected but when I started adding indirect lighting I was getting results that I neither expected nor understood.

There is a large amount of additional information about my explorations into indirect lighting in the context of trying to match actual photographs to images that I rendered which I have put in a separate post which can be read here. For this current post the only thing that you need to understand is that using pure white was bad and I had to find better approximations for the color of surfaces in the room.

Approximating Wall Color

I needed to change the wall color to not be perfectly white. I knew what paint had been used for my bathroom walls and so I did an internet search and found this page, which specifies not only a color for the paint but also an “LRV”. I don’t remember whether I had previously encountered the concept of LRV for paints, but it stands for “Light Reflectance Value” and sounded like exactly what I was looking for in order to be able to quantify how much light should be absorbed at each bounce.

The background of this current paragraph is using the sRGB color values [229, 225, 215] reported on that page and it looks pretty close to the color of my walls (especially when compared to sRGB white, look at the color of the following paragraph to compare). The relative luminance of these specified color values is ~75.41%, which is close to the reported 75.12% LRV but not exact. I haven’t found anything scientific discussing exactly what LRV represents or how it was calculated but since it was so close I decided to treat it as an accurate linear value representing diffuse reflectance (i.e. albedo). For the final wall and ceiling color I rescaled the RGB values to instead have a relative luminance of 75.12% (meaning that the background of this paragraph has the same chromaticity that the walls and ceiling have in my generated images but the actual luminance in the images is slightly darker than this background).

Compare the background of this current paragraph, which is sRGB white (i.e. [255, 255, 255]) with the background of the previous paragraph to get an idea of the color of the walls and ceiling in the room.

With that change to room color an image with only direct lighting looks like the following:

(The difference is subtle in this image because only direct lighting is calculated. One observable consequence of the different color can be seen in the overexposed lighting where there is now some yellow/red hue shift.)

In addition to the wall colors there is also trim, a door, and a toilet that are different colors of white, but I didn’t have them modeled and didn’t want to worry about them. The plastic light switch plate itself is also a different color of white but I didn’t change that either. The floor, however, is substantially different. Below is a photo that gives an idea of what the tile looks like:

Not only is there variation within each individual tile, but different tiles have different amounts of brightness. I didn’t want to try and recreate this with a texture, and rather than being scientific I just kind of tried different solid colors until it generally felt correct to me subjectively. Below is a rendering of direct light with the final wall and floor colors that I ended up with:

You can see that the floor is now darker than everything else (look at e.g. the overexposed part on the floor compared to the previous image), and it is neutral (as opposed to everything else which is slightly warmer because of the RGB colors of the paint; to be honest I think the floor color is probably not a warm enough grey but I kept it to try and prevent too much subjectivity from being introduced into the experiment). To compare here is the original image again where everything is perfectly white, which makes the differences easier to spot:

Light Color

The generated images that I have shown so far have a white light bulb (where “white” in this context is the white point of the sRGB primaries, i.e. D65), but recall the earlier discussion about the color of a 2700 K incandescent light and how it should be the background color of this paragraph. Making that change but still only showing direct lighting looks like the following:

The orange color in this image is representative of what we will be looking at in the rest of this post. To repeat, the color that my eyes perceived when I was actually in the room with the light bulbs was somewhere in between this image and the neutral image immediately preceding this one, but was closer to the white walls and lighting in the earlier image than the very obvious orange in this most recent image. Remember that the orange (i.e. the lack of white balance) is used in order to be able to more accurately compare the generated images and the photographs.

I should also emphasize that the only parts of the generated images that should be evaluated in this post are where the color is not overexposed! No tone mapping is being applied and instead I am just naively clamping any color values greater than one. Specifically, the changes in hue that can be seen in the lighting on the walls immediately surrounding the light bulb shouldn’t be understood as an attempt to model anything good or correct, perceptual or otherwise, and should instead be ignored.

For comparison below is the same image but with a different exposure so that the light bulb itself is visible:

In a properly tone-mapped image that sphere representing the bulb would generally be visible, and the big white/orange/red area that surrounds the light bulb in the brighter exposures doesn’t represent anything except poorly-handled overexposure. Luckily, the placement of the camera in the actual test images (i.e. the images that match the real photographs) don’t show the light bulb and so the extreme overexposure is not as much of a problem.

With the wall, floor, and light colors set up everything is in place to evaluate exposure and to compare whether the images that I generated programmatically match the photographs from my camera!

Comparisons

Rendered Exposure with Direct Lighting

Below is a comparison of rendered images with different light output and different exposure settings. It is one of the same settings comparisons that we saw earlier with real photographs.

190 lumens EV₁₀₀ 4:

800 lumens EV₁₀₀ 6:

The results are good and expected: The two images are almost identical. The image using 800 lumens is slightly brighter, but that is what we should expect because 190 lumens isn’t quite 200 and so it is slightly less than 1/4 of the brighter light source.

Even though this result is encouraging it isn’t that impressive: All that it really shows is that I am (probably) correctly making the scale between different EVs a factor of 2. In addition to the relative doubling or halving, however, there is an additional constant scale factor involved and there is no way to tell from looking at just these images in isolation whether that constant is correct. In order to try and evaluate the absolute exposure (rather than just the relative exposure) we need to compare a generated image with a real photograph.

Anticipatory drumroll…

Rendered 800 lumens EV₁₀₀ 6:

Photographed 800 lumens EV₁₀₀ 6:

Well… these two images don’t look similar at all! The good news is that my virtual scene set up looks pretty good (the camera seems to be in generally the right place because the light switch plate looks pretty similar in both images). The bad news is that there is an extremely noticeable difference in brightness between the two images, which is what the exposure value is supposed to control.

I have already spoiled the big problem: The rendered image is missing indirect lighting. The darker the walls (and floor and ceiling) and the bigger the room the less of an effect indirect lighting would have (because more light would be absorbed at each bounce before reaching the camera), but since the walls and ceiling in this room are white (not perfectly-reflective white, but still “white”) and the room itself is very small the indirect lighting makes a substantial difference that is missing in the generated image.

Rendered Exposure with Indirect Lighting

Using my software path tracer I can generate an image that includes indirect lighting in addition to the direct lighting, and we can compare that generated image with a real photograph.

800 Lumens

Rendered 800 lumens EV₁₀₀ 6:

Photographed 800 lumens EV₁₀₀ 6:

Hurray! These two images look very similar to me (I wouldn’t have correctly predicted how much brightness indirect bounced lighting adds to my little room). With the exception of the color of the grey tile floor which I just tried to subjectively guess everything else in the rendered image was made trying to use quantifiable values and so it really is exciting for me to see how closely the two images match each other.

The initial goal of this project was to implement exposure controls that matched the existing photographic concept of EV (exposure values). The following image comparisons show different exposures to see if specifying an EV for a generated image matches the same EV used in a real photograph.

Rendered/Photographed 800 lumens EV₁₀₀ 5:

Rendered/Photographed 800 lumens EV₁₀₀ 4:

The images with increased exposure don’t match as closely as EV₁₀₀ 6 did, but I’m not as concerned with comparisons of overexposure: The camera is doing some kind of HDR tone mapping and my rendering isn’t and so we probably shouldn’t expect the visuals to match.

More important is how higher EV levels (i.e. less exposure) look, and those comparisons are shown below.

Rendered/Photographed 800 lumens EV₁₀₀ 7:

Rendered/Photographed 800 lumens EV₁₀₀ 8:

Rendered/Photographed 800 lumens EV₁₀₀ 9:

Rendered/Photographed 800 lumens EV₁₀₀ 10:

Rendered/Photographed 800 lumens EV₁₀₀ 11:

Rendered/Photographed 800 lumens EV₁₀₀ 12:

I would rate these results as ok but not great; the comparisons are definitely not as close as I would have liked. EV₁₀₀ 7 still looks really similar to me (like EV₁₀₀ 6 that I showed initially), but higher EVs start to show pretty clear divergence between the generated images and the photographs. It does seem like the camera is doing some kind of color modification when the image is dark (apparently to increase the color saturation?), but even ignoring that change it looks to me like the camera photographs are brighter than the rendered images. This, unfortunately, means that they don’t really look the same. The difference isn’t terrible, but unlike the results for Ev₁₀₀ 6 and EV₁₀₀ 7 I don’t think anyone would look at the darker images and say “oh, yeah, those look the same!”, which is disappointing.

190 Lumens

For the sake of being complete a comparison is shown below of a rendered image and a photographed image with a 190 lumens light source at EV₁₀₀ 4:

We would expect these to look the same based on the previous comparisons, but it’s still nice to see confirmation and to end on a positive note.

Conclusion

I think that the comparisons between generated images and photographs are close enough to call this experiment a success, especially since the EV values that are close to how one would actually expose the scene (6, 7, and 8) are very similar between renderings and photographs. Let me repeat the sequence of EV₁₀₀ 6, 7, and 8 with photographs and renderings interleaved:

It’s disappointing that the darker exposures aren’t as close as when using EVs 6-8, but I think they are still close enough that I can probably consider my code to have a good foundation for controlling exposure. It’s certainly possible that I might still find a flaw or potential improvement in my code that would make the rendered images more similar to the photos (this current post is, after all, a revision of a previous post where I had initially asserted that I was happy with the results only to later change my mind), but as I am writing this I am satisfied with the results.

The remaining high-level tasks to deal with HDR values are to implement some kind of tone-mapping for those values that are too bright (rather than just clamping individual channels like I do in this post) and bloom (to help indicate to the observer how bright those values that are limited by the low dynamic range really are). Those HDR tasks and then some kind of white balancing would be required to actually generate images that appeared the same perceptually to me as what my eyes told my brain that I was seeing when I was in the room taking the photographs.

This was a fun and fulfilling project despite many frustrations along the way. I have worked with exposure in graphics many times but until now I had never implemented it myself. It has been interesting to finally be able to make a personal implementation working from first principles rather than using someone else’s existing version; working through problems one’s self always helps to gain a better understanding than just reading others’ papers and blog posts does. I especially have wanted to do experiments and comparisons with actual photographs for a long time and it was nice to finally get around to doing it. (This project also reinforced my understanding of how exposure works with real cameras, which was a nice bonus!)

If you found this post interesting I might recommend also reading this supplemental post with more images of indirect lighting whose content was originally part of this one. That post uses the same room and light bulb setup as this post and shows experiments that I did to try and gain a more intuitive understanding for how indirect lighting behaves.

Supplemental Material

Below I have included some additional technical details about these experiments. This information is probably only of interest if you happen to want to try implementing something similar yourself.

Exposure Scale

This section gives more technical details about how I apply different exposure values to the luminance values that are calculated as part of rendering. As previously mentioned I follow the method for adjusting exposure from the Moving Frostbite to PBR course notes, specifically the convertEV100ToExposure() function in section 5.1.2.

The Wikipedia page for exposure gives nice definitions with explanations, but the important thing is that exposure ($H$ in the Wikpedia article) is $E \cdot t$, where $E$ is either irradiance or illuminance and $t$ is time; this just says mathematically what the earlier discussion in this post tried to describe more intuitively, that exposure is directly related to time and describes how long the sensor is exposed to light while recording an image.

When a program is generated automatically, however, the time doesn’t matter (at least not with the standard kind of rendering that I am doing with a simplified camera model and no motion blur). In order to calculate exposure the goal is to end up with the same $H_v$ exposure effect that a physical camera would have given some illuminance and exposure time and to match that to the established EV system.

The Wikipedia page for film speed gives alternate equations to calculate $H$, which is where that Frostbite function comes from (as it states in the comments). The equation for $H$ can be used with the equation for saturation-based speed (and a “typical” value for $q$) to end up with the final simple formula that the Frostbite function has, which is something like:

const float ev = ?;	// The chosen EV100 to match my photographs
const float maxLuminanceThatWontSaturate = 1.2f * pow(2.0f, ev);
const float luminanceScaleForExposure =
	1.0f / maxLuminanceThatWontSaturate;

What this is doing is calculating some luminance value that is considered the maximum value possible without saturating the camera sensor, and then calculating a multiplicative value that scales everything equal to or lower than that luminance to [0,1] (and everything brighter than that luminance will be greater than 1).

Let’s examine the $q$ term and where the typical value that Frostbite uses comes from. The Wikipedia page shows how the final typical 0.65 value is calculated but there are several values used in the derivation that are based on aspects of real physical camera things that don’t seem relevant for the idealized rendering that I am doing. $q$ is defined as:

\[q = \frac{\pi}{4} \cdot lensTransmittance \cdot vignetteFactor \cdot cos^4(lensAxisAngle)\]

Transmittance is how much light makes it through the lens. In the idealized virtual camera that I am using there is no lens and no light is lost which means that the transmittance term is just 1.

The vignetting factor is how much the image light changes as the angle from the lens axis increases. I don’t want any vignetting in my image, and so the vignette term is just 1.

What I am calling the lens axis angle is the angle between the perpendicular line from at the center of the lens and any other part of the lens (the same angle that vignetting would use). My virtual camera is an idealized pinhole camera and as with vignetting I don’t want any change in exposure that is dependent on the distance from the center of the image. That means that the lens axis angle is always 0° and the cosine of the angle is always 1.

Thus, most of the terms when calculating $q$ simplify to 1:

\[q = \frac{\pi}{4} \cdot lensTransmittance \cdot vignetteFactor \cdot cos^4(lensAxisAngle)\] \[q = \frac{\pi}{4} \cdot 1 \cdot 1 \cdot 1^4\] \[q = \frac{\pi}{4}\]

And so I use $\pi/4$, about ~0.785, for $q$ instead of 0.65.

To calculate exposure, then, the maximum luminance that won’t clip can be calculated as:

\[maxNoClipLuminance = \frac{factorForGreyWithHeadroom}{isoSpeed \cdot q} \cdot 2^{EV_{100}}\]

The $EV_{100}$ in that formula is the user-selectable EV level (e.g. EV₁₀₀ 6) that we have seen in the photographs and generated images in this post.

The Wikipedia page for film speed defines a constant factor and its purpose: “The factor 78 is chosen such that exposure settings based on a standard light meter and an 18-percent reflective surface will result in an image with a grey level of 18%/√2 = 12.7% of saturation. The factor √2 indicates that there is half a stop of headroom to deal with specular reflections that would appear brighter than a 100% reflecting diffuse white surface.” I use the 78 without change.

I have mentioned ISO film speed a few times in this post, mostly in passing. For the purposes of specifying EVs for exposure I always want to assume a standardized ISO speed of 100 (which is what I mean when I write the subscript in something like EV₁₀₀ 6).

The formula for calculating the maximum luminance value that won’t clip for a given exposure value thus simplifies to:

\[maxNoClipLuminance = \frac{factorForGreyWithHeadroom}{isoSpeed \cdot q} \cdot 2^{EV_{100}}\] \[maxNoClipLuminance = \frac{78}{100 \cdot \frac{\pi}{4}} \cdot 2^{EV_{100}}\] \[maxNoClipLuminance = \frac{3.12}{\pi} \cdot 2^{EV_{100}}\]

This calculated $maxNoClipLuminance$ value is used by Frostbite to scale the raw luminance values such that a value that matches the magic value is scaled to one and any values with a higher luminance end up being greater than one. In other words:

const auto exposedLuminanceValue =
	rawLuminanceValue / maxNoClipLuminanceValue;

Using the calculation as a simple linear scale factor is nice because it’s easy to understand and computationally cheap but intuitively it somehow felt to me like it was too simple because it doesn’t take tone mapping into account and based on the Wikipedia quote regarding the target of $(18\% / \sqrt{2})$ I would have thought that the exposure would somehow have to be done in tandem with tone mapping. In the initial version of this post I even came up with an alternate way of scaling exposure that was my own invention and that seemed to produce better results. After I fixed all of my other bugs, however, the Frostbite method ended up producing images that are very close to the photographs from my camera (as seen in this post). Perhaps when I add tone mapping I will have to reevaluate the exposure scale and modify it, but for now it seems to work quite well.

Light Color

This section gives a few more technical details about how to calculate the incandescent color of a 2700 K black-body. This information is given with very little explanation and is intended for readers who already understand the concepts but are interested in checking my work or implementing it themselves.

The amount of radiance for each wavelength can be calculated using the formula on this Wikipedia page using wavelength: $\frac{2hc^2}{\lambda^5 (e^{\frac{hc}{\lambda k_B T}} – 1)}$

Human color-matching functions for each wavelength can be found here, which are necessary in order to convert spectral radiance values to luminance values. (If you’re not familiar with radiance vs. luminance then a simplified way to think of it intuitively is that I needed to only work with the light that is visible to humans and ignore infrared and ultraviolet light).

I used the 2006 proposed functions (rather than the classic 1931 functions), with a 2 degree observer in 1 nm increments. Weighting the spectral radiance of each wavelength gives an XYZ color for a specific spectral wavelength and integrating over all wavelengths gives the XYZ color that a “standard observer” would see when looking at a 2700 K black body. That XYZ color can be converted to an sRGB color:

According to my calculations the resulting sRGB gamma-encoded 8-bit RGB values for 2700 K are [255, 170, 84], the color of the background of this text block.

Digression: Luminance of an Incandescent Light Bulb?

When I calculate the color using the above technique I find that a 2700 K incandescent black-body’s luminance is ~12,486,408 nits.

This is a different value from other sources that I can find, significantly so in some cases. My first edition of Physically Based Rendering says 120,000 nits for a 60-watt light bulb (although that section appears to have been removed from the latest online edition), this page has the same 120,000 number but for a frosted 60 W bulb (which would understandably be a lower number), and this page says 10⁷ nits for the tungsten filament of an incandescent lamp, which is at least pretty close to my result compared to the others.

I am putting my result here in its own section in case anyone finds this through Google searching and has any corrections or comments about the actual emitted luminance of an incandescent light bulb. It seems to me that the watts of a bulb shouldn’t matter (for an incandescent bulb with the same color temperature the luminance should be the same regardless of watts (shouldn’t it?), and the total lumen output is determined by the surface area of the filament, which is visibly different in size between otherwise-identical incandescent bulbs rated for different watts), but I definitely could be making calculation errors somewhere. In a previous job when I had access to a spectroradiometer I did a lot of experimental tests to try and verify if my math was correct but that ended up being more difficult than I would have anticipated and so I wasn’t able to definitely verify the number to my satisfaction. To the best of my knowledge the luminance of a 2700 K black body is ~12,486,408 nits, even though I haven’t been able to find any independent corroboration of that. The filament of an incandescent light bulb gets extremely hot and also extremely bright!

Programmatic Light Luminance and Intensity

This section gives a few more technical details about how to calculate the luminance emitted by a spherical light source (and the luminous intensity emitted by an idealized point light source). This information is given with little explanation and is intended for readers who already understand the concepts but are interested in checking my work or implementing it themselves.

For a light bulb that is represented as a sphere first the luminous flux must be used to derive luminous emittance (also called luminous exitance). The flux is the total amount of energy emitted from the bulb per second, and the emittance/exitance is how much of that energy is emitted per surface area. In a real incandescent light bulb we would have to worry about the filament but if we instead model the light as an idealized sphere we can say that the amount of emitted light is uniform everywhere on the sphere (i.e. any single point on the sphere emits the same amount of light as any other point on the sphere), and so the flux can be divided by the surface area of a sphere:

\[emittance = flux / areaOfSphere\] \[emittance = flux / (4 \cdot \pi \cdot radiusOfSphere^2)\]

(Dividing by the surface area like this means that the flux density will be smaller for a point on bigger light bulb and greater for a point on a smaller light bulb but the total flux that is emitted will be the same. This is good because it means that even if my measurement of the radius of a physical light bulb isn’t exact the total flux will still be correct and that is the important value to represent accurately because it is the one that is reported by the manufacturer.)

Once luminous emittance is calculated it can be used to derive the actual luminance. The emittance is the density of the emitted energy at a single point, and the luminance is the angular density of that energy at a single point. Said a different way, now that we know how much energy is emitted from a single point we need to find out how much of that energy is emitted in a single direction from that single point. In a real incandescent light bulb we would have to worry about the emitting properties of the filament (and also the geometry of the filament since it is wound around and some parts occlude other parts), but when using an idealized sphere we can assume that the the same amount of light is emitted in every direction from every point. The relevant directions that light is emitted from a point on the surface of the sphere are in a hemisphere around that point on the surface (even though the surface itself is a sphere we are only concerned about a single point on that spherical surface, and the directions that light is emitted at a single point on that spherical surface are only in a hemisphere (there is no light emitted from the surface towards the center of the sphere, for example)). This means that to convert from luminous emittance to luminance is to divide by $\pi$ (note that although I’m stating this as a simple fact that doesn’t mean that it’s an obvious fact unless you’re used to integrating over hemispherical directions; if you want more explanation this might be a good place to start):

\[luminance = emittance / \pi\] \[luminance = flux / (4 \cdot \pi^2 \cdot radius^2)\]

In my original hardware ray tracing program I used a point light for simplicity but ironically this turned out to be harder for me to reason about than a spherical area light when figuring out how to calculate probabilities. A point light has no area and so neither luminous emittance nor emitted luminance can be calculated for it. Instead we can calculate luminous intensity, which is how much flux is emitted in each direction (i.e. it is kind of like luminance without the area). If we assume that the point light emits the same amount of light in all directions (just like the sphere) then:

\[intensity = flux / solidAngleOfAllSphericalDirections\] \[intensity = flux / (4 \cdot \pi)\]

When used to calculate the illuminance of a point on a surface (like my room’s wall) the luminous intensity must be modified by the distance from the point light squared.

Antialiasing Raytracing

I wanted to try and see what it looked like to only have a single primary ray for each pixel but to change the offset of the intersection within that pixel randomly every frame. The results have to be seen in motion to properly evaluate, but below is an example of a single frame:

A gif also doesn’t properly show the effect because the framerate is too low, but it can give a reasonable idea:

If you want to see it in action you could download the program. The controls are:

Arrow keys to move the block
- (I to move it away from you and K to move it towards you)
WASD to move the camera

The program relies on several somewhat new features, but if you have a GPU that supports ray tracing and you have a newish driver for it then the program will probably work. Although the program does error checking there is no mechanism for reporting errors; this means that if you try to run it but it immediately exits out then it may be because your GPU or driver doesn’t support the required features.

Samples and Filtering

In a previous post where I showed ray tracing the rays were intersecting the exact center of every pixel and that resulted in the exact same kind of aliasing that happens with standard rasterization (and for the same reason). This can be fixed by taking more samples (i.e. tracing more rays) for each pixel and then filtering the results but it feels wasteful somehow to trace more primary rays when the secondary rays are where I really want to be making the GPU work. I wondered what it would look like to only trace a single ray per-pixel but to adjust the sample position differently every frame, kind of similar to how dithering works.

You can see a single frame capture of what this looks like below:

This is displaying linear color values as if they were sRGB gamma-encoded, which is why it looks darker (it is a capture from PIX of an internal render target). Ignore the difference in color luminances, and instead focus on the edges of the cubes (and the edges of the mirror, too, although the effect is harder to see there). Every pixel will only show an intersection with a single object, and if the area around that pixel contains an object’s edge then by changing where the intersection happens in that area sometimes a pixel will show one object and sometimes another object. (Side note: Just to be clear, this effect happens at every pixel and not just edges of objects. If you look closely you might be able to tell that the color gradients also have noise in them. The edges of objects, however, is where the results of these jittered ray intersections are really noticeable.)

In a single frame this makes the edges look bumpy. In a sequence of frames (like when the program is being run) the effect is more noisy; the edges still feel kind of bumpy but the bumpiness changes every frame. I had hoped that it might look pretty good in motion but unfortunately it is pretty distracting and doesn’t feel great.

Even with only a single intersection per pixel, however, it is possible to use samples from neighboring pixels in order to filter each pixel. That means that you could have up to nine samples for each pixel, even though only a single one of those nine samples would actually be inside of the pixel’s area. Let me repeat the preceding screenshot but then also show the version that is using filtering with the neighboring eight pixels so that you can compare the two versions next to each other:

The same bumpiness contours are still there, but now they have been smoothed out. This definitely looks better, but it also makes the edges feel blurry (which makes sense because that is exactly what is happening). The real question is how does it look in motion?

Unfortunately this also didn’t look as good as I had hoped. It definitely does function like antialiasing, which is great, and you no longer see discreet lines of pixels appearing or disappearing as the cubes get closer or further from the camera. At the same time, the noise is more obvious than I had hoped. I think it is the kind of visual effect that some people won’t mind and will stop to notice but that other people will really find objectionable and constantly distracting. Again, if you’re interested, I encourage you to download the program yourself and try it to see what you think!

Is it Usable?

For now I am going to keep it. I like the idea of doing antialiasing “properly” and I think it will be interesting to see how this performs with smaller details (by moving the cubes very far from the camera I can get some sense for this, but I’d like to see it with actual detailed objects). There is also an obvious path for improvement by adding more samples per pixel, although computational expense may not allow that.

I haven’t looked into denoisers beyond a very cursory understanding and so it’s not clear to me whether denoising would make this problem better or whether using this technique would actually inhibit denoisers being able to work well.

I personally kind of like a bit of noise because it makes things feel more analog and real to me, but I realize that this is an individual quirk of mine that is far from universal (as evidenced by the loud complaints I read online when games have added film grain). Although the effect ends up being more distracting than I had hoped (I can’t claim that I would intentionally make it look like this if I had a choice) I still think it could potentially work as kind of an aesthetic with other noise. This will have to be evaluated further, of course, as I add more ray tracing effects with noise and create more complicated scenes with more complicated geometry.

Real-Time Ray Tracing on the GPU

It has been a long-standing goal of mine to program something using the hardware ray tracing capabilities that are now available and I finally have something to show:

I was introduced to ray tracing when I was a student at the University of Utah (my pages for class assignments were up for years and it was fun to go look at the images I had rendered; sadly, they eventually got removed and so I can’t link to them). At the time there was a lot of work being done there developing real-time ray tracers (done in software but not with the help of GPUs) which was really cool, but after I graduated I moved on to the games industry and although I never lost my love of ray tracing I don’t know that I ever did any further personal projects using it because I became focused on techniques being used in games.

When the RTX cards were introduced it was really exciting but I still had never made the time to do any programming with GPU ray tracing until now. It has been both fun and also gratifying to finally get something working.

Reflections

I first got ray tracing working a few weeks ago but I haven’t written a post until now. The problem was that the simple scene I had didn’t look any different from how it did when it was rasterized. Even though I was excited because I knew that the image was actually being generated with rays there wasn’t really any way to show that. The classic (stereotypical?) thing to do with ray tracing is reflections, though, and so I finally implemented that and here it is:

There are two colorful cubes and then a mirror-like wall behind them and a mirror-like floor below them:

I initially just had the wall mirror but I realized that it still showed something that could be done with rasterization (by rendering the scene a second time or having mirror image geometry, which is how older games showed mirrors or reflective floors) and so I added the floor mirror in order to show double reflections which really wouldn’t be seen with rasterization. Here is a screenshot where a single cube (in the lower-right corner) is shown four times because each reflective surface is also reflecting the other reflective surface:

Implementing Hardware Ray Tracing

One of the nice things about ray tracing compared to rasterization is how simple it is. The more advanced techniques used to make compelling images are harder to understand and require complexity to implement, but the basics of generating an image are quite straightforward and pretty easy to understand. I wondered if that would translate into working with the GPU, and I have good news but also bad news:

The Good News

The actual shaders that I wrote are refreshingly straightforward, and the code is essentially the same kind of thing I would write doing traditional CPU software ray tracing. I am using DirectX, and especially when using dynamic resource indexing it is really easy to write code the way that I would want to (in fact, with how I am used to working with shaders (which I have discovered is very outdated) I kept finding myself honestly amazed that the GPU could do what it was doing).

The following is shader code to get the normal at the point of intersection. I don’t know if it will look simple because of my verbose style, but I did change some of my actual code a little bit to try and shorten it for this post hehe:

// Get the renderable data for this instance
const StructuredBuffer<sRenderableData> renderableDataBuffer =
	ResourceDescriptorHeap[g_constantBuffer_perFrame.index_renderableDataBuffer];
const sRenderableData renderableData = renderableDataBuffer[InstanceID() + GeometryIndex()];
// Get the index buffer of this renderable data
const int indexBufferId = renderableData.indexBufferId;
const Buffer<int> indexBuffer = ResourceDescriptorHeap[
	NonUniformResourceIndex(g_constantBuffer_perFrame.index_indexBufferOffset + indexBufferId)];
// Get the vertex indices of this triangle
const int triangleIndex = PrimitiveIndex();
const int vertexCountPerTriangle = 3;
const int vertexIndex_triangleStart = triangleIndex * vertexCountPerTriangle;
const uint3 indices = {
	indexBuffer[vertexIndex_triangleStart + 0],
	indexBuffer[vertexIndex_triangleStart + 1],
	indexBuffer[vertexIndex_triangleStart + 2]};
// Get the vertex buffer of this renderable data
const int vertexBufferId = renderableData.vertexBufferId;
const StructuredBuffer<sVertexData> vertexData = ResourceDescriptorHeap[
	NonUniformResourceIndex(g_constantBuffer_perFrame.index_vertexBufferOffset + vertexBufferId)];
// Get the vertices of this triangle
const sVertexData vertexData_0 = vertexData[indices.x];
const sVertexData vertexData_1 = vertexData[indices.y];
const sVertexData vertexData_2 = vertexData[indices.z];
// Get the barycentric coordinates for this intersection
const float3 barycentricCoordinates = DecodeBarycentricAttribute(i_intersectionAttributes.barycentrics);
// Interpolate the vertex normals
const float3 normal_triangle_local = normalize(Interpolate_barycentricLinear(vertexData_0.normal, vertexData_1.normal, vertexData_2.normal, barycentricCoordinates));

The highlighted lines show where I am looking up data (this is also how I get the material data, although that code isn’t shown). In lines 2-3 I get the list of what I am currently calling “renderable data”, which is just the smallest unit of renderable stuff (currently a vertex buffer ID, an index buffer ID, and a material ID). In line 4 I get the specific renderable data of the intersection (using built-in HSL functions), and then I proceed to get the indices of the triangle’s vertices, and then the data of those vertices, and then I interpolate the normal of each vertex for the given intersection (I also interpolate the vertex colors the same way although that code isn’t shown). This retrieval of vertex data and then barycentric interpolation feels just like what I am used to with a ray tracer.

The following is most of the shader code used for recursively tracing a new ray to calculate the reflection (I have put ... where code is removed to try and remove distractions):

RayDesc newRayDescription;
{
	const float3 currentRayDirection_world = WorldRayDirection();
	{
		const float3 currentIntersectionPosition_world = WorldRayOrigin() + (currentRayDirection_world * RayTCurrent());
		newRayDescription.Origin = currentIntersectionPosition_world;
	}
	{
		const cAffineTransform transform_localToWorld = ObjectToWorld4x3();
		const float3 normal_triangle_world = TransformDirection(normal_triangle_local, transform_localToWorld);
		const float3 reflectedDirection_world = reflect(currentRayDirection_world, normal_triangle_world);
		newRayDescription.Direction = normalize(reflectedDirection_world);
	}
	...
}
const RaytracingAccelerationStructure rayTraceAcceleration = ResourceDescriptorHeap[
	g_constantBuffer_perFrame.index_rayTraceAccelerationOffset
		+ g_constantBuffer_perDispatch.rayTraceAccelerationId];
...
sRayTracePayload newPayload;
{
	newPayload.recursionDepth = currentRecursionDepth + 1;
}
TraceRay(rayTraceAcceleration, ..., newRayDescription, newPayload);
color_material *= newPayload.color;

A new ray is calculated, using the intersection as its origin and in this simple case a perfect reflection for its new direction, and then the TraceRay() function is called and the result used. Again, this is exactly how I would expect ray tracing to be and it was fun to write these ray tracing shaders and feel like I was putting on an old comfortable glove. (Also again, it feels like living in the future that the GPU can look up arbitrary data and then trace arbitrary rays… amazing!)

The Bad News

Unlike the HLSL shader code, however, the C++ application code was not so straightforward. I should say up front that I am not trying to imply that I think the API has anything wrong with it or is poorly designed or anything like that: The way that it is all makes sense, and there wasn’t anything where I thought something was bad and should be changed. Rather, it’s just that it felt very complex and hard to learn how it all worked (and even now when I have it running and have wrapped it all in my own code I find that it is difficult for me to keep all of the different parts in my head).

I think part of the problem for me personally is that it’s not just DirectX ray tracing (“DXR”) that is new to me, but also DirectX 12 (all of my professional experience is with DX11 and the previous console generation which was DX9-ish). I think that it’s pretty clear that learning DXR would have been much easier if I had already been familiar with DX12, and that a large part of what feels like so much complexity is just the inherent complexity of working with low-level GPU stuff, which is new to me.

Still, it seems safe to assume that many people wanting to learn GPU ray tracing would run into the same difficulties. One of the nice things about ray tracing that I mentioned earlier in this post is how simple it is, but diving into it on the GPU at the pure low level that I did is clearly not a good approach for anyone who isn’t already familiar with graphics. One thing that really frustrated me trying to learn it was that the samples and tutorials from both Microsoft and Nvidia that purported to be “hello world”-type programs still used lots of helper code or frameworks and so it was really hard to just see what functions had to be called (one big exception to this was this awesome tutorial which was exactly what I wanted and much more helpful to me for what I was trying to do than the official Microsoft and Nvidia sample code). I think I can understand why this is, though: If someone isn’t trying to write their own engine explicitly from scratch but instead just wants to trace rays then writing shaders is probably a better use of time than struggling to learn the C++ API side.

What’s Next?

Although mirror reflections are cool-ish, the real magic that ray tracing makes possible is when the reflections are spread out. I don’t have any experience with denoising and so it’s hard for me to even predict what will be possible in real time given the limited number of rays that can be traced each frame, but my dream would be to use ray tracing for nice shadows. Hopefully there will be cooler ray tracing images to show in the future!

Problems Building Shaders

I have been working with my own custom build system, and it has been working really well for me while building C++ code. As I have been adding the capability to build shaders, however, I have run into several problems, some of which I don’t have good solutions for. Most of them are curiously interrelated, but the biggest single problem could be summarized as:

The #include file dependencies aren’t known until a shader is compiled, and there isn’t a great mechanism for getting that information back to the build system

This post will discuss some of the individual problems and potential solutions.

How Dependencies are Handled Generally

I have a scheme for creating files with information about the last successful execution of a task. There is nothing clever or special about this, but here is a simple example:

{
	"version":1,
	"hash":"2ade222a1aff526d0d4df0a2f5849210",
	"paths":[
		{"path":"Engine\\Type\\Arithmetic.test.cpp","time":133783428404663533},
		{"path":"temp\\win64\\unoptimized\\intermediate\\jpmake\\Precompiled\\Precompiled_common.c843349982ad311df55b7da4372daa2d.pch","time":133804112386989788},
		{"path":"C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.41.34120\\bin\\Hostx64\\x64\\cl.exe","time":133755106284710903},
		{"path":"jpmake\\IDE\\VS2022\\temp\\x64\\Release\\output\\jpmake\\jpmake.exe","time":133804111340658384}
	]
}

This allows me to record 1) which files a specific target depends on, 2) what the last-modified times of each file was the last time that the task with the target as an output was executed, and 3) a hash value that contains any other information that should cause the task to be executed again if something changes.

This has worked well for all of the different kinds of tasks that I have added so far and it usually corresponds pretty closely what the user has manually specified. The one big exception has been C++ compilation in which case things are more complicated because of #include directives. Rather than require the user to specify individual files that are #included (and then transitive files that those files #include) the build system instead uses a command argument to have MSVC’s compiler create a file with the dependencies and then that file’s contents are added to the dependency file that my build system uses.

This is possible because I have made building C++ a specialized kind of task where I have implemented everything; this means that the Lua interface that the user (me) worries about is designed to be easy-to-use and then the build system takes care of all of the annoying details.

What about Shaders?

My mental goal has been to design the build system generally so that any kind of task can be defined, and that Lua is used for the project files so that any kind of programming can be done to define these tasks. I always knew that I would make some kind of specialized tasks that could be considered common (copying files is the best example of this), but the aspirational design goal would be to not require specialized tasks for everything but instead let the user define them.

The main motivation behind this design goal was so that “assets” could be handled the same way as code. I wanted to be able to build asset-building programs and then use them to build assets for a game, and I didn’t want to have to create specialized task types for each of these as part of the build system (the build system should be able to be used with any project, with no knowledge about my specific game engine).

Building shaders as the first kind of asset type has revealed a problem that is obvious but that I somehow didn’t fully anticipate: How to deal with implicit dependencies, specifically #include directives, when the task type is just an arbitrary command with arguments that the build system has no specialized knowledge of?

Dependency Callback

Like its C++ cl.exe compiler, Microsoft’s dxc.exe shader compiler can output a file with dependencies (e.g. the #include files). My current scheme is to allow the user to specify a callback that is called after a command execution succeeds and which returns any dependencies that can’t be known until after execution; this means that the callback function can parse and extract the data from the dependencies file that dxc.exe outputs and then report that data back to the build system. Here is currently how that can be done for a single shader:

local task_engineShaders = CreateNamedTask("EngineShaders")
local path_input = ResolvePath("$(engineDataDir)shaders.hlsl")
do
	local dxc_directory = g_dxc_directory
	local path_output_noExtension = ResolvePath("$(intermediateDir_engineData)shader_vs.")
	local path_output = path_output_noExtension .. "shd"
	local path_dependenciesFile = path_output_noExtension .. "dep.dxc"
	task_engineShaders:ExecuteCommand{
			command = (dxc_directory .. "dxc.exe"), dependencies = {(dxc_directory .. "dxcompiler.dll"), (dxc_directory .. "dxil.dll")},
			inputs = {path_input},
			outputs = {path_dependenciesFile},
			arguments = {path_input,
					"-T", "vs_6_0", "-E", "main_vs", "-fdiagnostics-format=msvc",
					"-MF", path_dependenciesFile
				},
			postExecuteReturnDependenciesCallback = ReturnShaderDependencies, postExecuteReturnDependenciesCallbackUserData = path_dependenciesFile
		}
	local path_debugSymbols = path_output_noExtension .. "pdb"
	local path_assembly = path_output_noExtension .. "shda"
	task_engineShaders:ExecuteCommand{
			command = (dxc_directory .. "dxc.exe"), dependencies = {(dxc_directory .. "dxcompiler.dll"), (dxc_directory .. "dxil.dll"), path_dependenciesFile},
			inputs = {path_input},
			outputs = {path_output},
			arguments = {path_input,
					"-T", "vs_6_0", "-E", "main_vs", "-fdiagnostics-format=msvc",
					"-Fo", path_output, "-Zi", "-Fd", path_debugSymbols, "-Fc", path_assembly,
				},
			artifacts = {path_debugSymbols},
			postExecuteReturnDependenciesCallback = ReturnShaderDependencies, postExecuteReturnDependenciesCallbackUserData = path_dependenciesFile
		}
end

The first problem that is immediately noticeable to me is that this is a terrible amount of text to just build a single shader. We will return to this problem later in this post, but note that since this is Lua I could make any convenience functions I want so that the user doesn’t have to manually type all of this for every shader. For the sake of understanding, though, the code above shows what is actually required.

The source path is specified on line 2, the compiled output path on line 6, and then the dependencies file path that DXC can output on line 7. Two different commands are submitted for potential execution, on lines 8 and 20, and then line 29 shows how the callback is specified (there is a function and also some “user data”, a payload that is passed as a function argument).

Why are there two different commands? Because DXC requires a separate invocation in order to generate the dependency file using -MF. I found this GitHub issue where this is the behavior that was specifically requested (although someone else later in the issue comments writes that they would prefer a single invocation, but I guess that didn’t happen). This behavior is annoying for me because it requires two separate sub tasks, and it is kind of tricky to figure out the dependencies between them.

Here is what the callback function looks like:

do
	local pattern_singleInput = ":(.-[^\\]\r?\n)"
	local pattern_singleDependency = "%s*(.-)%s*\\?\r?\n"
	local FindAllMatches = string.gmatch
	--
	ReturnShaderDependencies = function(i_path_dependenciesFile)
		local o_dependencies = {}
		-- Read the dependencies file
		local contents_dependenciesFile = ReadFile(i_path_dependenciesFile)
		-- Iterate through the dependencies of each input
		for dependencies_singleInput in FindAllMatches(contents_dependenciesFile, pattern_singleInput) do
			-- Iterate through each dependency of a single input
			for dependency in FindAllMatches(dependencies_singleInput, pattern_singleDependency) do
				o_dependencies[#o_dependencies + 1] = dependency
			end
		end
		return o_dependencies
	end
end

First, the good news: It was pretty easy to write code to parse the dependencies file and extract the desired information, which is exactly what I was hoping for by using Lua.

Unfortunately, there is more bad news than good.

The reason that there is a separate callback and then user data (rather than using the dependencies file path as an upvalue) is so that there can be just a single function, saving memory. I also discovered a way to detect if the function had changed using the lua_dump() function, and this seemed to work well initially. I found, however, that lua_dump() doesn’t capture upvalues (which makes sense after reading about it, because upvalues can change), which means that there is a bug in the callback I show above: If I change the patterns on line 2 or line 3 it won’t trigger the task to execute again because I don’t have a way of detecting that a change happened. In this specific case this is easy to fix by making the strings local to the function (and, in fact that’s probably better anyway regardless of the problem I’m currently discussing), but it is really discouraging to realize that there is this inherent problem; I don’t want to have to remember any special rules about what can or can’t be done, and even worse I don’t want dependencies to silently not work as expected if those arcane rules aren’t followed.

There is another fundamental problem with using a Lua function as a callback: It means that there must be some kind of synchronization between threads. The build system works as follows:

Phase 1: Gather all of the info (reading the Lua project file and storing all of the information needed to determine whether tasks need to be executed and to execute them)
- During this phase all data is mutable, and everything is thus single-threaded
Phase 2: Iterate through each task, deciding whether it must be executed and executing it if necessary
- During this phase all data is supposed to be immutable, and everything can thus be multi-threaded
- (As of this post everything is still single-threaded because I haven’t implemented different threads yet, but I have designed everything to be thread safe looking towards the future when tasks can be executed on different threads simultaneously)

The problem with using Lua during Phase 2 is that it violates these important design decisions. If Phase 2 were multi-threaded like it is intended to be then there would be a bad bug if two different threads called a Lua function at the same time. I don’t think that the problem is insurmountable: There is a lua_newthread() function that can be used to have a separate execution stack, but even with that there are some issues (e.g. if I assume that all data is immutable then I could probably get away without doing any locking/unlocking in Lua, but I don’t have any way of enforcing that if I allow arbitrary functions to be used as callbacks (a user could do anything they wanted, including having some kind of persistent state that gets updated every time a Lua function is called), which again involves arcane rules that aren’t enforceable but could cause bad bugs).

What I really want is to allow the user to use as much Lua as they want for defining tasks and dependencies in Phase 1, but to never use it again once Phase 2 starts. But, even though that’s the obvious ideal situation, how is it possible to deal with situations where dependencies aren’t known until execution and they must be reported?

Possible Solutions

Unfortunately, I don’t have good solutions yet for some of these problems. I do have a few ideas, however.

Automatic Dependencies

My dream change that would help with a lot of these problems would be to have my build system detect which files are read when executing a task so that it can automatically calculate the dependencies.

Look again at one of the lines where I am defining the shader compilation task:

command = (dxc_directory .. "dxc.exe"), dependencies = {(dxc_directory .. "dxcompiler.dll"), (dxc_directory .. "dxil.dll")},

The user specifies dxc.exe, which is good because that’s the program that must run. But then it’s frustrating that dxcompiler.dll and dxil.dll also have to be specified; I only happen to know that they are required because they come bundled with dxc.exe, but how would I know that otherwise? There are certainly other Windows DLLs that are being used by these three that I don’t have listed here, and why should I have to know that? Even if I took the time to figure it out a single time is it also my responsibility to track this every time that there is a DXC update?

There is a build system named tup that takes care of figuring this out automatically, but although I have looked into this several times it doesn’t seem like there is an obvious way to do this (at least in Windows). I think the general method that would be required is 1) create a new process but with its main thread suspended, 2) hook any functions in that process that are relevant, 3) resume the main thread. The hooking step, though, seems tricky. It also seems difficult to differentiate between file accesses that should be treated as dependencies and ones that are outputs; I think maybe if there is any kind of write permission then it could be treated as not-a-dependency, but I haven’t fully thought that through.

I have looked into this automatic dependency calculation several times and every time given up because it looks like too big of a task. If I were working full time on the build system I think this is what I would do, but since my main goal currently is the game engine it has been too hard for me to justify trying to make this work. Having said that, it seems like the only way to handle these problems that would satisfy me as a user, and so maybe some day I will dive in and try to make it work.

Two Invocations of DXC.exe

Without automatic dependency calculation then it seems required to have two different invocations of dxc.exe. The solution to this from the standpoint of a nice build project file seems, at least, pretty obvious: Write a custom tool with my own command arguments which could internally execute dxc.exe twice, but that would only be run once from the perspective of the build system. Having a wrapper tool like this would also mean that I could create my own command arguments, with a related benefit that the same program and same arguments could be called for any platform (and the wrapper tool could then do the appropriate thing).

Avoiding Lua Callbacks

One idea I have had is that an execute-command task could be required to create its own dependency file. If I am already going to write some kind of shader compiler wrapper program around dxc.exe then I could also have it load and parse the dependency file and output a different dependency file in the format that my build system wants. This has some clear disadvantages because it means a user is expected to write similar wrapper tools around any kind of command that they want to execute, which is obviously a huge burden compared to just writing a Lua function. I think the advantage is that it would avoid all of the potential problems with violating confusing rules about what can or can’t be done in a Lua function in my build system, and that is attractive to avoid silent undetectable bugs, but it certainly doesn’t fit in with the ergonomics that I imagined.

I need to think about this more and hope that I come up with better ideas.

Specialized Shader Tasks

A final solution would be for me to just make a specialized task type for building shaders, just like I have for C++. Then I could write it all in C++ code, try to make it as efficient as possible, and it would accomplish the same end goal of letting me specify configuration and making it cross platform just like a separate “wrapper” program would do, but it would be part of the build system itself (rather than requiring any user of the build system to do the same thing). This avoids the fundamental problems with arbitrary commands but those problems will eventually have to be solved for other asset types; I think it might be worth doing, however, because shaders are such an important kind of asset.

The big reason I hesitate to do this (besides the time that it would take to do) is that there is a lot of custom code that can be built up around shaders. My engine is still incredibly basic and so I have just focused on compiling a hand-written shader source file, but eventually (if I have time) the shader source files will be generated somehow, and there could well be a very involved process for this and for getting reflection data and for generating accompanying C++ code. Knowing this and thinking about past systems that I have built at different jobs I am not sure if it makes sense to spend time making a system focused on dxc.exe that I might not end up using; maybe I will end up with a “wrapper” program anyway that does much more than just compiling shaders.

Update: Improving User API with Lua

I alluded to this in the original post but didn’t give an example. Since Lua is used to define how things get built it is possible to hide many of the unpleasant boilerplate details of a task (e.g. a task to build shaders) behind nice abstractions. The following is an example of something that I came up with:

local task_engineShaders = CreateNamedTask("EngineShaders")
local path_input = "$(engineDataDir)shaders.hlsl"
local path_output_vs, path_output_ps
do
	local entryPoint = "main_vs"
	local shaderType = "vertex"
	local path_intermediate_noExtension = "$(intermediateDir_engineData)shader_vs."
	path_output_vs = path_intermediate_noExtension .. "shd"
	BuildShader(path_input, entryPoint, shaderType, path_output_vs, path_intermediate_noExtension, task_engineShaders)
end
do
	local entryPoint = "main_ps"
	local shaderType = "pixel"
	local path_intermediate_noExtension = "$(intermediateDir_engineData)shader_ps."
	path_output_ps = path_intermediate_noExtension .. "shd"
	BuildShader(path_input, entryPoint, shaderType, path_output_ps, path_intermediate_noExtension, task_engineShaders)
end

This code builds two shaders now (a vertex and a pixel/fragment shader), compared to the single shader shown previously in this post, but it is still much easier to read because most of the unpleasantness is hidden in the BuildShader() function. This function isn’t anything special in the build system but is just a Lua function that I defined in the project file itself. This shows the promise of why I wanted to do it this way, because it makes it very easy to build abstractions with a fully-powered programming language.

How does a Waitable Swap Chain Work?

DXGI 1.3 introduced the concept of a waitable swap chain, which allows a program to block waiting for a vblank before a frame starts getting generated rather than having to block waiting for a vblank after a frame is finished being generated. The basic concept is easy to understand but I had a lot of questions about the details of how it worked that were not at all obvious to me from any documentation that I could find.

Disclaimer: This post is not written from a place of confident knowledge. Instead, I am documenting my current understanding based on some frustrating trial and error but it is quite possible that I have some details wrong. I am writing this in case it helps others to save time (I am surprised that I haven’t found others asking similar questions, although maybe I just haven’t been using the correct search terms), but caveat emptor.

Some official documentation on waitable swap chains:

Summary

The following is a list of things that I think I have discovered but that were not obvious to me:

The waitable object must be (successfully) waited on for every call to Present(), but it is initialized with some additional inherent waits equal to the value provided to IDXGISwapChain2::SetMaximumFrameLatency()
- I believe that any outstanding required waits are canceled by DXGI_PRESENT_RESTART to Present() (meaning that it is ok to not wait on the waitable object before calling Present() with that flag, and after a successful call to Present() with that flag there is only a single time that the waitable object must be waited for regardless of the state of the waitable object before that call)
The waitable object behaves like a FIFO (first-in-first-out) queue, and so waiting for it to signal is waiting for the oldest queued Present() call to actually be executed
The waitable object behaves like a Windows event, meaning that there is not a chance of missing the vblank signal if application code doesn’t wait soon enough. If an attempt to wait is made before the vblank then the wait will block, but an attempt can be made any time after the vblank (which should hopefully return immediately), and successfully waiting on the waitable object is what clears/resets a particular queued Present() call.
When a swap chain is created with a waitable object, Present() not only doesn’t block (which the documentation makes clear) but also doesn’t fail if the present queue is exceeded (and, specifically, doesn’t return DXGI_ERROR_WAS_STILL_DRAWING even if DXGI_PRESENT_DO_NOT_WAIT is specified as a flag).

That last point made things especially difficult for me: The application is solely responsible for tracking the state of the present queue and this is generally fine since the waitable swap chain gives you the tools to do that, but the problem is that the documentation doesn’t mention the details that I list above. It is hard to know how to track the state of the present queue without knowing the behavior that one is tracking 🐔🥚.

Open questions that I don’t yet know the answers to:

What happens if too many Present() calls are made and the present queue is exceeded?
- (I do know that you still have to successfully wait for the waitable object to signal for every Present() call that returns success. I don’t know, however, what actually gets presented and when.)
What happens when 0 is provided as the sync interval to Present()?
- This, at least, is clearly discoverable. I just haven’t put in the time to do tests.
What happens when a value greater than 1 is provided as the sync interval to Present()?
- I know how I would expect this to work, but I haven’t put in the time to verify this. With some very cursory tests things seemed to mostly behave as I would expect, but it wasn’t entirely clear and more testing on my part would be required to verify.

Below I will go into a few more details.

Paired Present() and Waits

This behavior is actually described on an Intel webpage: https://www.intel.com/content/www/us/en/developer/articles/code-sample/sample-application-for-direct3d-12-flip-model-swap-chains.html

Conceptually, the waitable object can be thought of as a semaphore which is initialized to the Maximum Frame Latency, and signaled whenever a present is removed from the Present Queue.

Embarrassingly, I had read that sentence several times as the entire page has valuable information but it unfortunately didn’t sink in. It was only in retrospect after I had figured it out for myself that I understood what it was saying.

The way that I ended up stumbling on the answer myself is to test how many times the waitable object can be waited on before it blocks and doesn’t return. Although I haven’t discovered any way of doing the opposite (i.e. detecting when the present queue is full) waiting for a wait to block at least gives a pretty clear way of testing how many Present() calls are “queued” from the perspective of the waitable object.

I don’t understand why it gets initialized to the SetMaximumFrameLatency() value, however. It’s not really clear to me what specific vblanks are actually being waited on in this case (are they specific vblanks that get queued somehow, or is there special code that detects that they are dummy pre-queued frames and just waits for the next one?). Also, in my code at least I end up wanting to do my own initial Present() and wait to initialize some timing (which I need to do with my own calls because that’s how the DXGI frame statistics work), and so these pre-queued things just get in the way. It seems like the decision to do this was probably made because otherwise the obvious pattern of waiting before the first frame would have blocked forever, but I guess I would have preferred this to be more explicitly documented (rather than just urging programmers to remember to wait before the first frame without explaining why).

I also ran into problems trying to figure out how to discard a queued present using DXGI_PRESENT_RESTART. I had a desire to do this when I detected that the call to Present() didn’t return until after a vblank, and so I knew that it missed the deadline and was going to take a future frame’s spot. It was surprisingly tricky for me to figure out how to actually make this work, though, since the waitable object has some implicit behavior. It kind of seems to me that DXGI_PRESENT_RESTART does not reset the waitable object’s internal count of waits (meaning that you can still wait for the number of successful Present() calls before it blocks indefinitely), which was frustrating because I didn’t know how to clear this out (the only way to clear it is to successfully wait, but successfully waiting meant I was delayed until the next vblank which is exactly what I was trying to avoid). Eventually, though, I just pretended that it was cleared and then everything seemed to work (meaning that the successful returns from waits happen when I would want them to). It’s not clear to me what is actually happening, and whether me pretending the problem doesn’t exist will eventually come back to bite me.

Frame Pacing with Simulation and Input

This post expands on ideas that were introduced in the Frame Pacing in a Very Simple Scene post, and that earlier post should probably be read and understood before this one in order to have context into what the profiler screenshots mean.

To skip to the images find the Visualizing in a Profiler section.

To skip to the sample programs find the Example Programs section or download them here.

In the previous post I had used the Tracy profiler to help visualize what my rendering code was doing; specifically I was interested in scheduling when CPU code would generate commands for the GPU to execute to make sure that it was happening when I wanted it to. I raised the question of why not generate GPU commands even earlier to make sure that they were ready in time and discussed why doing so would work with predetermined content (like video, for example), but would be less ideal for an interactive experience (like a game) because it would increase latency between the user providing input and the results of that input being visible on the display.

This post will show the work I have done to implement some user input that can influence the game simulation, and how that interacts with rendering frames for the display.

How to Coordinate Display Frames with Simulation Updates

A traditional display changes the image that it displays at a fixed rate, called the “refresh rate”. As an example, the most common refresh rate is 60 Hz, where “Hz” is the symbol for Hertz, a unit of frequency, which means that a display with a fixed refresh rate of 60 Hz is changing the image that it displays 60 times every second. This means that a program has to generate a new image and have it ready for the display 60 times every second. There is a fixed deadline enabled by something called vsync, and this puts a constraint on how often to render an image and when. (For additional details about dealing with vsync and timing, see the post Syncing without Vsync. For additional details about how to schedule the generation of display frames for vsync see the post that I have already referred to, Frame Pacing in a Very Simple Scene.)

Just because the display is updating at some fixed rate, however, doesn’t mean that the rest of the game needs to update at that same fixed rate, and it also doesn’t mean that the rest of the game should update at that same fixed rate. There is a very well-known blog post that discusses some of the reasons why, Fix Your Timestep!, which is worth reading if you haven’t already. In my own mental model the way that I think about this is that I want to have an interactive experience that is, conceptually, completely decoupled from the view that I generate in order to let the user see what is going on. The game should be able, conceptually, to run on its own with no window, and the player could also pick up a controller and influence that game, and whatever happens should happen the same way regardless of what it would look like to the player if there were a window (and regardless of what refresh rate, resolution, or anything else the player’s display happens to have).

My challenge since the last post, then, has been to figure out how to actually implement this mental model.

Display Loop

I am trying to design my engine to be platform-independent but the only actual platform that I am programming for currently is Windows; this means that behind all of the platform-independent interfaces there is only a single platform-specific implementation and it is for Windows. On Windows there is a very tight coupling between a window and a D3D12 swap chain, and there is also a very tight coupling between any window and the specific thread where it was created. Because of this, my current design is to have what I am calling the “display thread”, and this thread is special because it handles the main window event queue (meaning the events that come from the Windows operating system itself). This is also the thread where I submit the GPU render commands and requests to swap (for the swap chain) and that means that there is a built-in cadence in this thread that matches the fixed refresh rate.

What I needed to figure out was how to allow there to be a separate fixed update rate for the game simulation, which could either be faster or slower than the display refresh rate, and how to coordinate between the two. The strategy that was initially obvious to me was to have a separate thread that handled the simulation updates and then to have the display thread be able to coordinate with this simulation thread. I think that this would work fine (many games have a “game thread” and a “render thread”), but there is an aspect of this design that made me hesitate to implement it this way.

My impression is that over the past several years games have been moving away from having fixed threads with specific responsibilities and instead evolving to be more flexible where “jobs” or “tasks” can be submitted to any available hardware threads. The benefit of this approach is that it scales much better to any arbitrary number of threads. I have read about engines that don’t have a specific “game thread” or “render thread”, and the idea is appealing to me on a conceptual level. Although I don’t have any job system in place yet I feel like this is the direction that I at least want to aspire to, and that meant that I had to avoid my initial idea of having a separate simulation thread with its own update cadence.

Instead, I needed to come up with a design that would allow something like the following:

It can be calculated which display frame needs to be rendered next, and what the time deadline is for that to be done
Based on that, it can be calculated which simulation updates must happen in order to generate that display frame
Based on that, some threads can update the simulation accordingly
Once that is done some threads can start to generate the display frame

This general idea gave me a way to approach implementing simulation updates at some fixed rate that coordinated with display updates at some other fixed rate, even though I only have a single thread right now and no job system. Or, said another way, if I could have different multiple simultaneous (but arbitrary and changeable) update cadences in a single thread then it might provide a natural way forward in the future to allow related tasks to run on different threads without having to assign specific fixed responsibilities to specific fixed threads.

Interpolation or Extrapolation?

There is something tricky that I don’t address in the simplified heuristic above: How can it be known which simulation updates must be done before a display frame can be generated? At a high level there are three options:

Make the simulation and display tightly coupled. I already discussed why this is not what I want to do.
Simulate as much as possible and then generate a display frame that guesses what will happen in the future. This involves taking known data and extrapolating what will happen next.
Simulate as much as possible and then generate a display frame of what has already happened in the past. This involves taking known data and interpolating what has already happened.

For months I have been fascinated with #2, predicting what will happen in the future and then extrapolating the rendering accordingly. I am not personally into fighting games but somewhere I heard or read about “rollback netcode” and then started reading and watching as much as I could about it. Even though I have no intention of making a network multiplayer game I couldn’t shake the thought that the same technique could be used in a single-player game to virtually eliminate latency. I ultimately decided, however (and unfortunately), that this wasn’t the way to go because I wasn’t confident in my ability to deal with mispredictions in a way that wouldn’t be distracting visually. My personal priorities are for things to look good visually and less about reducing latency or twitch gaming, which means that if there is a tradeoff to be made it will usually be to increase visual quality. (With that being said, if you are interested I think that my single favorite source that I have found about these ideas is https://www.youtube.com/watch?v=W3aieHjyNvw. Maybe at some point in the future I will have enough experience to try tackling this problem again.)

That leaves strategy #3, interpolation. This means that in order to start generating a frame to display I can only work with simulation updates that have already happened in the past, and that I need to figure out when these required simulation updates have actually happened so that I can then know when it is ok to start generating a display frame.

Below is a capture I made showing visually what I mean by interpolation:

The two squares are updating at exactly the same time (once each second), and they have identical horizontal positions from the perspective of the simulation. The top square, however, is rendering with interpolation such that every frame that is generated for display places the square where the simulation would have been at some intermediate point between discrete updates. This gives the illusion of smooth motion despite the simulation only updating once each second (the GIF makes the motion look less smooth, but you can download and try one of the sample programs to see how it looks when interpolating at your display’s actual refresh rate).

Notice in the GIF how this interpolation scheme actually increases latency: The bottom square shows the up-to-date simulation, and the top square (what the user actually sees) is showing some blended state between the most up-to-date position and then the previous, even older, position. The only time that the user sees the current state of the simulation is at the moment when the simulation is the most out-of-date, immediately before it is about to be updated again. For a slow update rate like in the GIF this delay is very noticeable, but on the opposite extreme if the simulation is updating faster than the display rate then this is less of a problem. You can download and try the sample programs to find out how sensitive you are to this latency and how much of a difference you can personally notice between different simulation update rates.

The (Current) Solution

I have come up with two variables that determine the behavior of a simulation in my engine and how it gets displayed:

The (fixed) duration of a simulation update, in milliseconds
The (fixed) cutoff time before a display refresh, in milliseconds, when a display frame will start to be generated

This was a surprisingly (embarrassingly!) difficult problem for me to solve especially given how simple the solution seems, and I won’t claim that it is the best solution or that I won’t change the strategy as time goes on and I get more experience with how it works. Let me elaborate on the second point, though, since that was the tricky part (the first point is obvious and so there is not much to discuss).

I knew that I wanted the simulation to run independently of the display. Said another way, the simulation shouldn’t even know or care that there are any frames being displayed, and the framerate of the display should never have any influence on the simulation. Note that this also extends to user input: The same input from a user should have the same effect on the simulation regardless of what framerate that user’s display has.

On the other hand, it seemed like it was necessary to have the display frames have some kind of dependency on the simulation: In order to interpolate between two simulation frames there has to be some kind of offset involved where the graphics system knows how far between two frames it is. My brain kept thinking that maybe the offset needed to be determined by how long the display’s refresh period is, but I didn’t like that because it seemed to go against my goal of different players having the same experience regardless of refresh rate.

Finally I realized that it could be a (mostly) arbitrary offset from the vertical blank. What this represents conceptually is a tradeoff between latency (a smaller time is better) and computational requirements (it must be long enough that the CPU and GPU can actually generate a display frame before the vblank happens). Having it be a fixed offset, however, means that it will behave mostly the same regardless of what the display is doing. (There is still going to be some system-dependent latency: If DWM is compositing then there is an inherent extra frame of latency, and then there is some unpredictable time from when the GPU scans the frame out until it is actually visible on the display which is dependent on the user’s actual display hardware and how it is configured. What this current solution seems to do, however, is to minimize the effects of display refresh rate in the software domain that I have control over.)

Visualizing in a Profiler

In order to understand how the simulation is scheduled relative to rendering we can look at Tracy profile captures. This post won’t explain the basics of how to interpret these screenshots, but you can refer to this previously-mentioned earlier post for a more thorough explanation.

In all of the screenshots in this post my display refresh rate is set to 60 Hz which means that a generated image is displayed for 16.67 milliseconds (i.e. 1000 / 60 ms). I have set the time before the vblank to start rendering to be half of that, 8.33 ms. Note that this doesn’t mean that this time is in any way meaningful or that it is ideal! Instead, I have deliberately chosen this time to make it easier for you as a reader to envision what is happening; the cutoff for when input for simulation frames is no longer accepted is always going to be at the halfway point between display frames.

Remember: The cutoff for user input in these screenshots is arbitrarily chosen to be exactly halfway between display frames.

100 Hz Simulation

This first example has the simulation updating at 100 Hz, which means that each simulation update represents 10 ms (i.e. 1000 / 10):

As in the previous post, the DisplayFrontBuffer zones show when the display is changing, and it corresponds to the unnamed “Frames”, which each take 16.67 ms.

New in this post is a second kind of frame, which I’ve highlighted in green below:

This shows where the simulation frames are happening, and each lasts roughly 10 ms. In the previous two screenshots I have the default display frames selected, but Tracy also allows the other frames to be selected in which case it will show those frames at the top:

Each display frame takes 16.67 ms and each simulation frame takes 10 ms, and since these are not clean multiples the way that they line up is constantly varying. The following facts can help to understand and interpret these profile screenshots:

The display frames shown are determined by a thread waiting for a vblank, and so this cadence tends to be quite regular and close to the ideal 16.67 ms
The simulation frames shown are determined by whenever a call to Windows’s Sleep() function returns, and more specifically whenever a call to Sleep() returns that is after the required cutoff for input. This means that the durations shown are much more variable, and although helpful for visualization they don’t actually show the real simulation frames
- Sleep() is notoriously imprecise; I have configured the resolution to be 1 ms, and newer versions of Windows have improved the call to behave more closely to how one might expect (see also here), but it is still probably not the best tool for what I am using it for.
- For the purposes of this post you just need to keep in mind that the end of each simulation updates shown in the screenshots is conservative: It is always some time after the actual simulation update has ended, but how long after is imprecise and (very) inconsistent.
The simulation frames shown don’t include any work done to actually update the simulation. Instead they represent the timeline of logical simulation updates; one way to think of them is the window of time when the user can enter input to influence the following simulation update (although this mental model may also be confusing because it hints at additional latency between the user’s input and that input being used; we will ignore that for this post and focus more the relationship between the simulation and the display but it will be something that I will have to revisit when I work on input).

Let’s look at a single display frame, highlighted in green below:

The blue swap chain texture is currently being displayed (i.e. it is the front buffer), and so the yellow swap chain texture must be written to in order to be displayed next (i.e. it is the back buffer). The purple WaitForSimulationInputCutoff zones show how the 10 ms simulation updates line up with the 16.67 ms display updates. The cutoff that I have specified is 8.83 ms before the vblank, and so the last WaitForSimulationInputCutoff that we can accept for rendering must finish before the halfway point of that blue DisplayFrontBuffer zone. In this example that I have chosen the simulation frame ends a little bit before the halfway point, and then the work to generate a new display frame happens immediately. Here is a zoomed in version of that single display frame to help understand what is happening:

After the purple WaitForSimulationInputCutoff is done the CPU work to record and submit GPU commands to generate a display frame happens. In the particular frame shown in the screenshot there is no need to wait for anything (the GPU finished the previous commands long ago, and the yellow back buffer is available to be changed because the blue front buffer is being shown which means that recording and submission of commands can happen immediately). The GPU also does the work immediately (the little green zone above the blue DisplayFrontBuffer), and everything is ready to be displayed for the next vblank, where the yellow texture can now be swapped to become the front buffer.

The situation in this screenshot is probably the easiest one to understand: The yellow rendering work is generating a display frame about 8.33 ms before it has to be displayed. That display frame will show an interpolation between two simulation frames, and the last of those two simulation frames ended where that yellow rendering work begins.

To try and make it more clear I’ve moved the timeline slightly earlier in the following screenshot so that we can see both simulation frames that are being used to generate the display frame, highlighted in green:

The yellow rendering work is now at the right of the screenshot (look at Frame 278 at the top to help orient yourself), and there are two purple WaitForSimulationInputCutoff zones highlighted. The yellow rendering work at the right of the screenshot is generating a display frame that contains an interpolation of some point in time between the first WaitForSimulationInputCutoff zone and the second WaitForSimulationInputCutoff zone.

Let us now compare the next display frame, highlighted in green below:

Now the yellow texture is being displayed as the front buffer and the blue texture is the back buffer that must be modified. Because of the way that two different 16.67 ms and 10 ms cadences line up, however, the last simulation frame that can be used for this display frame ends relatively earlier, which means that rendering work can begin relatively earlier (note how the blue CPU work happens earlier in the display frame in the highlighted Frame 279 compared to where the yellow CPU work happened in the preceding Frame 278 that we looked at previously). The way that I think about this is to ask “when will the next simulation frame end?” In this case it would end after the 8.33 ms cutoff (at the midpoint of yellow Frame 279), and so it can’t be used for this display frame.

The two simulation frames that are being used to generate Frame 279 are shown below highlighted in green:

The ending simulation frame is new, but the beginning simulation frame is the same one that was used as the ending simulation frame for the previous display frame. These examples hopefully give an idea of how the display frames are being generated independently of the cadence of the simulation frames, and how they are showing some “view” into the simulation by interpolating between the two most recently finished simulation frames. The code uses the arbitrary cutoff before the vblank (8.33 ms in these screenshots) in order to decide 1) what the two most recently finished simulation frames are and 2) how much to interpolate between the beginning and ending simulation frames for a given display frame.

When interpolating, how can the program determine how far between the beginning simulation frame and the ending simulation frame the display frame should be? The amount to interpolate is based on how close the display frame cutoff is to the end of the simulation frames. This can be confusing for me to think about, but the key insight is that it is dependent on the simulation update rate. The way that I think about it is that I take the cutoff in question (which, remember, happens after the ending simulation frame) and then subtract the duration of a simulation update from that. Whatever that time is must fall between the beginning and ending simulation update, and thus determines the interpolation amount.

I’m not sure whether to spend more time explaining that or whether it’s obvious to most people and possibly just confusing to me for some reason. I think that what makes it tricky for me to wrap my head around is that the interpolation is between two things that have both happened in the past (whereas if I were instead doing extrapolation with one update in the past and one in the future then it would be much more natural for how my brain tries to think about it). Rather than belaboring the point I will repeat the GIF of the two moving squares, which I think is an easier way to understand visually how the interpolation is working:

Let’s look again at the two simulation frames that each display frame uses, one after the other:

It is hard to say anything very precise about the interpolation amount because of the imprecision of the purple WaitForSimulationInputCutoff zones (because of the inherent imprecision of the Sleep() function), but we can at least make some generalizations using some intuitive reasoning:

The earlier that the ending simulation update happens relative to the display frame cutoff, the more the interpolation amount will be biased towards the ending simulation update
The closer that the ending simulation update happens relative to the display frame cutoff, the more the interpolation amount will be biased towards the beginning simulation update

With that in mind, in the first screenshot we can say that the yellow display frame work (which will be shown in Frame 279) is going to show something closer to the beginning simulation update (the first one highlighted in green); we can claim this because the ending simulation update ended so close to the 8.33 ms cutoff (halfway between the blue DisplayFrontBuffer zone), and if we subtract the 10 ms simulation update duration it is going to be pretty close to end of the beginning simulation update.

In the second screenshot, by comparison, we can say that the blue display frame work (which will be shown in Frame 280) is going to show more of the ending simulation update (the second one highlighted in green); we can claim this because the ending simulation updated ended quite some time before the 8.33 ms cutoff.

Before moving on to a different simulation update rate let’s look one more time at the original screenshot that shows many display and simulation frames:

This screenshot shows how the two different frame cadences line up at different points. It also shows how the rendering work starts being done as soon as it is possible, but that it isn’t possible until after the end of the last simulation update before the cutoff (again, remember that the cutoff is arbitrary but was chosen to be 8.33 ms so that it is easy to visualize as being at the halfway point of each display frame).

250 Hz

This next example has the simulation updating at 250 Hz, which means that each simulation update is 4 ms (i.e. 1000 / 250):

There are now more simulation updates for every display update (about 4, i.e. 16.67 ms / 4 ms), but otherwise this profile should look familiar if you were able to follow the previous discussion about 100 Hz. One interesting pattern is that the smaller the duration of simulation updates the more consistently closer to the halfway cutoff the display frame render work will be (you can imagine it as taking a limit as simulation updates get infinitesimally small, in which case the display frame render work would conceptually never be able to start early).

A high frequency of simulation updates like this would be the ideal way to run. The visual representation of the simulation would still be limited by the display’s refresh rate but the simulation itself would not be. Of course, the simulation is still constrained by how computationally expensive it is to update. In my example I’m just updating a single position using velocity and so it is no problem, but in a real game the logic and physics will take more time. You can download and try an example program to see how it feels to have the square’s position update at 250 Hz.

(I have also found that there are many more visual glitches/hitches running at this high frequency, meaning that display frames are skipped for various reasons. In my particular program this might be my fault, since the rendering work seems so simple (if I used a busy wait instead of Sleep(), for example, I might get better results), but it generally suggests that it probably isn’t possible to have such low latency with such a high refresh rate in a non-toy program. If I increase the cutoff time to give my code some buffer for unexpectedly long waits then things work much better, even at a high simulation update frequency.)

10 Hz

The next example has the simulation updating at 10 Hz, which means that each simulation update is 100 ms (i.e. 1000 / 10). Notably, this is a case of the opposite situation as the previous example: Rather than having many simulation updates for one display update this example has many display updates for one simulation update:

This example looks quite a bit different from the preceding two because the simulation update duration is so much longer than the display update duration. You can see that there are still purple WaitForSimulationInputCutoff zones, but they only occur when there is an actual need to wait, and that only happens when a simulation update is going to end during a display refresh period.

To make this more clear I have highlighted the actual simulation frame below in green:

Note that the simulation update itself is quite long, and the purple WaitForSimulationInputCutoff zone only happens at the end. If a display frame must be generated but the two required simulation updates that it is going to interpolate between have already happened then there is no need to wait for another one.

On the other hand, we do see new waits that are much more obvious than they were in the previous examples. These are the reddish waits, where the most obvious one is WaitForSwap_forSwapQueueSpot. I apologize for the awkward name (I have been trying different things, and I’m not in love with this one yet), but it is waiting for the currently-displayed front buffer to be swapped so that it no longer displayed and can be modified as a back buffer. Rendering happens so quickly that a very large percentage of the display frame is spent waiting until the back buffer texture is available, after which (when the DisplayFrontBuffer color switches) some CPU work and then GPU work is done very quickly, and the waiting starts again (for additional information see the previous post that I have already referenced, although it was just called “WaitForSwap” there). In a real renderer with real non-trivial work to be done there would be less waiting, but that is dependent on how long it takes to record GPU commands (before the vblank) and how long it takes to submit and execute those commands (after the vblank).

Although it is hard to see any details it is also instructive to zoom out more and see what it looks like over the course of several simulation updates:

There should be 6 display frames for every simulation update, and this can be seen in the screenshot.

This 10 Hz rate of simulation update is too slow to be satisfying; you can download a sample program and try it to see what it feels like, but it is pretty sluggish. Still, it is good news that this works because it is definitely possible that a program might update its display faster than its simulation and so trying an extreme case like this is a good test.

60 Hz

This next example has the simulation updating at 60 Hz, which means that each simulation update is 16.67 ms (i.e. 1000 / 60). More notably, it means that the simulation is updating at exactly the same rate as the display:

The display frame cutoff is still 8.33 ms, halfway between vblanks, but since the simulation and display update at the same time the simulation update ends close to a vblank and so the display frame work is able to happen early, right after a vblank.

(Since the simulation is happening on the CPU there is probably some small real-world difference between the CPU clock and the display refresh rate which has its own display hardware clock.. If I let this program run long enough there is probably some drift that could be observed and eventually the simulation updates would probably end closer to the cutoff (and then drift again, kind of going in and out of phase). Relatedly, I could probably intentionally make it not line up when the program starts if I put some work into it; the alignment is most likely an accidental result of how and when I am starting the simulation which is probably coincident with some display frame.)

30 Hz

This next example has the simulation updating at 30 Hz, which means that each simulation update is 33.33 ms (i.e. 1000 / 30), and also that the display updates twice for every single simulation update:

This shows what we would have expected: There are two display frames for every one simulation update, and that means that there is only a purple WaitForSimulationInputCutoff zone in the display frame where a simulation update is ending before the halfway 8.33 ms cutoff.

There is one part of the screenshot, however, that is interesting and perhaps surprising, during display Frame 307. I’ve highlighted it in green below:

The previous pattern suddenly changes, and there is an unusually long WaitForSimulationInputCutoff zone. It kind of looks like there was a missed display frame (which can happen, although I haven’t shown it in this post; the program can fail to render a display frame in time and then needs to recover), but if we look at the pattern of alternating blue and yellow in the Main thread this isn’t actually the case.

The key to understanding what is happening is to look at the simulation frames rather than focusing on the display frames:

If you look at Simulation 152 and how it lines up with the work in the Main thread you may be able to figure out what is going on. This is an example of things (expectedly) going out of phase, and the code (correctly) dealing with it. If you compare Simulation 152 to the preceding Simulation 151 and subsequent Simulation 153 you can see that nothing unusual happened, and the simulation updates are happening approximately every 33.33 ms, as expected. What is unusual, however, is that Simulation 152 is ending right before the halfway cutoff for simulation updates, and so that yellow rendering work (for display frame 309) can’t start until Simulation 152 has finished. In the context of this screenshot it looks very scary because there is only 8.33 ms for the entire display frame to be generated, but this is the exact cutoff that I have chosen, meaning that I have told the program “don’t worry, I can do everything I need to in 8.33 ms and so use that as the cutoff for user input”. All of the other display frames in the screenshot have much more time (they have to wait for the front buffer to be swapped before they can even submit GPU commands), but this particular instance is doing exactly what I have told it to do, and is just a curious result of the different clocks of the CPU and display not being exact (which, again, is expected). (Notice that immediately after the yellow work with the tight deadline there is blue work, which puts the graphics system back on the more usual schedule where it is done ahead of time and can wait.)

Example Programs

I have made some example programs that have different simulation update rates but are otherwise identical. You can download them here.

Running one of the sample EXEs will, if it doesn’t crash 🙃, show a window that looks like this:

You can move the square using the arrow keys.

The square always moves in discrete steps, at whatever frequency the EXE filename indicates. Visually, however, the displayed frames use interpolation, as discussed in this post. (There are two alternate versions of the slowest update frequencies that don’t do any interpolation; these can be useful to run and compare with the versions that do use interpolation in order to better understand what is happening and why the slow frequencies feel kind of weird and sluggish, even though they look smooth.)

The cutoff before the vblank is set to 10 ms. This is slightly longer than all of the screenshots used in this post (that used 8.33 ms), but it is still small enough that glitches aren’t unusual at the higher simulation update frequencies. In a real program that is doing serious rendering work more time would probably be needed, but for the purposes of these demonstrations I wanted to try and minimize latency.

These programs can be fun to run and see how sensitive you personally are to latency. Try not to focus on any visual hitches (where e.g. a frame might be skipped), but instead try to concentrate on whether it feels like the square responds instantly to pressing a key or releasing it, or whether it seems like there is a delay. Some things to try:

Really quick key presses. I generally use the same direction and then tap a rapid staccato pattern. This is what I personally am most sensitive to where the higher frequencies feel better.
Holding down a key so the square moves with some constant velocity and then releasing it. I personally am not bothered as much by a delayed start (I know that some people are) as I am by a delayed stop (where the square keeps moving after I’ve let go of a key).
At the lowest frequencies (5 Hz and 10 Hz) move the square close to the edge of the window and then try to tap quickly to move it in and out, meaning where a grey border is visible and when it isn’t. If you do this using the versions without interpolating it becomes pretty easy to intuitively understand how the square is moving in discrete steps (because there is a fixed velocity used with the fixed simulation timesteps), and then if you try the same thing with interpolation you can observe the same thing. Even though it looks smooth the actual behavior is tied to a grid, both spatially and temporally, and that’s a large part of why it feels weird (and why sometimes one key press might feel better or worse than another key press, depending on how much that key press aligns with the temporal “grid”).

Frame Pacing in a Very Simple Scene

I have recently integrated the Tracy profiler into my engine and it has been a great help to be able to visualize how the CPU and GPU are interacting. Even though what is being rendered is as embarrassingly simple as possible there were some things I had to fix that weren’t behaving as I had intended. Until I saw the data visualized, however, I wasn’t aware that there were problems! I have also been using PIX for Windows, NSight, RenderDoc, and gpuview, but Tracy has really been useful in terms of presenting the information across multiple frames in a way that I can customize to see the relationships that I have wanted to see. I thought that it might be interesting to post about some of the issues with screenshots from the profiler while things are still simple and relatively easy to understand.

Visualizing Multiple Frames

Below is a screenshot of a capture from Tracy:

I have zoomed in at a level where 5 full frames are visible, with a little bit extra at the left and right. You can look for Frame 395, Frame 396, Frame 397, Frame 398, and Frame 399 to see where the frames are divided. These frame boundaries are explicitly marked by me, and I am doing so in a thread dedicated to waiting for IDXGIOutput::WaitForVBlank() and marking the frame; this means that a “frame” in the screenshot above indicates a specific frame of the display’s refresh cycle.

There is a second frame visualization at the top of the screen shot where there are many green and yellow rectangles. Each one of those represents the same kind of frames that were discussed in the previous paragraph, and the purple bar shows where in the timeline I am zoomed into (it’s hard to tell because it’s so small but there are 7 bars within the purple section, corresponding to the 1 + 5 + 1 frames visible at the level of zoom).

In addition to marking frames Tracy allows the user to mark what it calls “zones”. This is a way to subdivide each frame into separate hierarchical sections in order to visualize what is happening at different points in time during a frame. There are currently three threads shown in the capture:

The main thread (which is all that my program currently has for doing actual work)
An unnamed thread which is the vblank heartbeat thread
GPU execution, which is not a CPU thread but instead shows how GPU work lines up with CPU work

I’ve highlighted those three threads in green in the screenshot below:

In order to try and help me make sure that I was understanding things properly I have color coded some zones specifically according to which swap chain texture is relevant. At the moment my swap chain only has two textures (meaning that there is only a single back buffer at any one time and the two textures just get toggled between being the front buffer or back buffer any time a swap happens) and they are shown with DarkKhaki and SteelBlue. In the heartbeat thread the DisplayFrontBuffer zone is colored according to which texture is actually being displayed during that frame (actually this is not true because of the Desktop Windows Manager compositor, but for the purposes of this post we will pretend that it is true conceptually).

I’ve highlighted the alternating colors of DisplayFrontBuffer for each frame in green in the screenshot below, which you can use to see which swap chain texture is being displayed during each frame:

I have used the same colors in the main CPU thread to show which swap chain texture GPU commands are being recorded and submitted for. In other words, the DarkKhaki and SteelBlue colors identify a specific swap chain texture, the heartbeat thread shows when that texture is the front buffer, and the main thread shows when that texture is the back buffer. At the current level of zoom it is hard to read anything in the relevant zones but the colors at least give an idea of when the CPU is doing work for a given swap chain texture before it is displayed.

I’ve highlighted the alternating colors of RenderGraphicsFrameOnCpu for each frame in green in the screenshot below, which you can use to see which swap chain texture will be modified during each frame:

Unfortunately for this post I don’t think that there is a way to dynamically modify the colors of zones in the GPU timeline (instead it seems to be a requirement that are known at compile time) and so I can’t make the same visual correspondence. From a visualization standpoint I think it would be nice to show some kind of zone for the present queue (using Windows terminology), but even without that it can be understood implicitly. I will discuss the GPU timeline more later in the post when things are zoomed in further.

With all of that explanation let me show the initial screenshot again:

Hopefully it makes some kind of sense now what you are looking at!

Visualizing a Single Frame

Let us now zoom in further to just look at a single frame:

During this Frame 326 we can see that the DarkKhaki texture is being displayed as the front buffer. That means that the SteelBlue texture is the back buffer, which is to say that it is the texture that must be modified so that it can then be shown during Frame 327.

Look at the GPU timeline. There is a very small OliveDrab zone that shows work being done on the GPU. That is where the GPU actually modifies the SteelBlue back buffer texture.

Now look at the CPU timeline. There is a zone called RenderGraphicsFrameOnCpu which is where the CPU is recording the commands for the GPU to execute and then submitting those commands (zones are hierarchical, and so the zones below RenderGraphicsFrameOnCpu are showing it subdivided even further). The color is SteelBlue, and so these GPU commands will modify the texture that was being displayed in Frame 395 and that will again be displayed in Frame 397. You may notice that this section starts before the start of Frame 396, while the SteelBlue texture is still the front buffer and thus is still being displayed! In order to better understand what is happening we can zoom in even further:

Compare this with the previous screenshot. This is the CPU work being done at the end of Frame 365 and the beginning of Frame 366, and it is the work that will determine what is displayed during Frame 367.

The work that is done can be thought of as:

On the CPU, record some commands for the GPU to execute
On the CPU, submit those commands to the GPU so that it can start executing them
On the CPU, submit a swap command to change the newly-modified back buffer into the front buffer at the next vblank after all GPU commands are finished executing
On the GPU, execute the commands that were submitted

It is important that the GPU doesn’t start executing any commands that would modify the SteelBlue swap chain texture until that texture becomes the back buffer (and is no longer being displayed). The WaitForSwap zone shows where the CPU is waiting for the swap to happen before submitting the commands (which triggers the GPU to start executing the commands). There is no reason, however, that the CPU can’t record commands ahead of time, as long as those commands aren’t submitted to the GPU until the SteelBlue texture is ready to be modified. This is why the RenderGraphicsFrameOnCPU zone can start early: It records commands for the GPU (you can see a small OliveDrab section where this happens) but then must wait before submitting the commands (the next OliveDrab section).

How early can the CPU start recording commands? There are two different answers to this, depending on how the application works. The simple answer (well, “simple” if you understand D3D12 command allocators) is that recording can start as soon as the GPU has finished executing the commands that were previously submitted that were saved in the memory that the new recording is going to reuse. There is a check for this in my code that is so small that it can only be seen if the profiler is zoomed in even further

I’ve highlighted the WaitForGpuToReachSwap zone in green in the screenshot below, which shows where the CPU made sure that it was ok to start recording commands for the GPU:

The reason that this wait is so short is because the GPU work being done is so simple that it reached the submitted swap long before the CPU checked to make sure.

I’ve highlighted the path from submitting GPU commands on the CPU to executing GPU commands to waiting for GPU commands to finish executing on the CPU in green in the screenshot below:

Do you see that long line between executing the GPU commands and then recording new ones on the CPU? With the small amount of GPU work that my program is currently doing (clearing the texture and then drawing two quads) there isn’t anything to wait for by the time I am ready to start recording new commands.

If you’ve been following you might be asking yourself why I don’t start recording GPU commands even sooner. Based on what I’ve explained above the program could be even more efficient and start recording commands as soon as the GPU was finished executing the previous commands, and this would definitely be a valid strategy with the simple program that I have right now:

This is a capture that I made after I modified my program to record new GPU commands as soon as possible. The WaitForPredictedVblank zone is gone, the WaitForGpuToReachSwap zone is now visible at this level of zoom, and the WaitForSwap zone is now bigger. The overlapping of DarkKhaki and SteelBlue is much more pronounced because the CPU is starting to work on rendering a new version of the swap chain texture as soon as that swap chain texture is displayed to the user as a front buffer (although notice that the commands still aren’t submitted to the GPU until after the swap happens and the texture is no longer displayed to the user). Based on my understanding this kind of scheduling probably represents something close to the ideal situation if 1) a program wants to use vsync and 2) knows that it can render everything fast enough within one display refresh and 3) doesn’t have to worry about user input.

The next section explains what the WaitForPredictedVblank is for and why user input makes the idealized screenshot above not as good as it might at first seem.

When to Start Recording GPU Commands

Earlier I said that there were two different answers to the question of how early the CPU can start recording commands for the GPU. In my profile screenshots there is a DarkRed zone called WaitForPredictedVblank that I haven’t explained yet, but we did observe that it could be removed and that doing so allowed even more efficient scheduling of work. This WaitForPredictedVblank zone is related to the second alternate answer of when to start recording commands.

I’ve highlighted the WaitForPredictedVblank zone in green in the screenshot below:

My end goal is to make a game, which means that the application is interactive and can be influenced by the player. If my program weren’t interactive but instead just had to render predetermined frames as efficiently as possible (something like a video player, for example) then it would make sense to start recording commands for the GPU as soon as possible (as shown in the previous section). The requirement to be interactive, however, makes things more complicated.

The results of an interactive program are non-deterministic. In the context of the current discussion this can be thought of as an additional constraint on when commands for the GPU can start being recorded, which is so simple that it is kind of funny to write out: Commands for the GPU to execute can’t start being recorded until it is known what the commands for the GPU to execute should be. The amount of time between recording GPU commands and the results of executing those commands being displayed has a direct relationship to the latency between a user providing input and the user seeing the result of that input on a display. The later that the contents of a rendered frame are determined the less latency the user will experience.

All of that is a long way of explaining what the WaitForPredictedVblank zone is: It is a placeholder in my engine for dealing with game logic and simulation updates. I can predict when the next vblank is (see the Syncing without VSync post for more details), and I am using that as a target for when to start recording the next frame. Since I don’t actually have any work to do yet I am doing a Sleep() in Windows, and since the results of sleeping have limited precision I only sleep until relatively close to the predicted vblank and then wait on the more reliable swap chain waitable object (this is the WaitForSwap zone):

(Side note: Being able to visualize this in the instrumented profile gives more evidence that my method of predicting when the vblank will happen is pretty reliable, which is gratifying.)

The next step will be to implement simulation updates using fixed timesteps and then record GPU commands at the appropriate time, interpolating between the two appropriate simulation updates. That will remove the big WaitForPredictedVblank, and instead there will be some form of individual simulation updates which should be visible.

Conclusion

If you’ve made it this far congratulations! I will show the initial screenshot one more time, showing the current state of my engine’s rendering and how work for the GPU is scheduled, recorded, and submitted:

There is a follow-up post about adding simulation frames here.

Syncing without VSync

I don’t remember how many years ago it was when I first read about doing this, but it was probably the following page which introduced me to the idea: https://blurbusters.com/blur-busters-lagless-raster-follower-algorithm-for-emulator-developers/. I am pleased to be able to say that I have finally spent the time to try and implement it myself:

How it Works

An application can synchronize with a display’s refresh cycle (instead of using vsync to do it) if two pieces of data are known:

The length of a single refresh (i.e. the refresh rate)
- This is a duration
The time when a refresh ends and the next one starts (i.e. the vblank)
- This is a repeating timestamp (a moment in time)

If both of these are known then the application can predict when the next refresh will start and update the texture that the graphics card is sending to the display at the appropriate time. If the graphics card changes the texture at the wrong time then “tearing” is visible, which is a horizontal line that separates the old previous texture above it and the new texture below it. (This Wikipedia article as a simulated example image of tearing.)

The texture that is active that the graphics card is sending to the display is called the “front buffer”. The texture that isn’t visible but that the application can generate before it is activated and being sent to the display is called the “back buffer”. There is different terminology for the act of making a back buffer into a front buffer but this post will call it “swapping”, and conceptually it can be thought of as treating the former back buffer as the new front buffer while what was the front buffer becomes a back buffer.

(What vsync does is take care of swapping front and back buffers at the appropriate time. If the application finishes generating a new texture in the back buffer and submits it in time then the operating system and graphics card will work together to make sure that it becomes the new front buffer for the display during the vblank. If the application doesn’t submit it in time and misses the vblank then nothing changes visually: The display keeps showing the old front buffer (meaning that a frame is repeated) and there is no tearing).

Why does swapping the front and back buffers at the wrong time cause tearing? Because even though swapping the front and back buffers happens instantaneously in computer memory the change from one frame to another on a display doesn’t. Instead the display updates gradually, starting at the top and ending at the bottom. Although this isn’t visible to a human eye the effect can be observed using a slow motion camera:

With that in mind it is possible to understand what I am doing in my video: I wrote a program that is manually swapping front and back buffers four times every refresh period in order to intentionally cause tearing. I am not actually rendering anything but instead just doing a very simple clearing of the entire back buffer to some single color, but by swapping a single color back buffer at the correct time in the middle of a display’s refresh I can change the color somewhere in the middle of the display.

Doing this isn’t particularly new and my results aren’t particularly impressive, but I was happy to finally find the time to try it.

Here is a bonus image where I wasn’t trying to time anything and instead just swapped between alternating red and green as quickly as I was able to:

Implementation Details

The rest of this post contains some additional technical information that I encountered while implementing this. Unlike the preceding section which was aimed at a more general audience the remainder will be for programmers who are interested in specific implementation details.

How to Calculate Time

Calculating the duration of a refresh period isn’t particularly difficult but it’s not sufficient to simply use the reported refresh rate. Although the nominal refresh rate would be close it wouldn’t exactly match the time reported by your CPU clock, and that’s what matters because that’s what you’ll be using to know when to swap buffers. In order to know what the refresh rate is in terms of the CPU clock an average of observed durations must be made. In order to calculate a duration it is necessary to keep track of how much time has elapsed between consistently repeating samples, but it doesn’t actually matter where in the refresh cycle these samples come from as long as they are consistently taken from the same (arbitrary) point in the refresh cycle. So, for example, timing after IDXGISwapChain::Present() returns (with a sync interval of 1 and a full present queue) would work, and timing after IDXGIOutput::WaitForVBlank() returns would also work.

It is more difficult, however, to calculate when the vblank actually happens.

DXGI_FRAME_STATISTICS

I finally settled on using IDXGISwapChain::GetFrameStatistics(). Using this meant that I was relying on DXGI instead of taking my own time measurements, but the big attraction of doing that is that the timestamps were then tied directly to discrete counters. Additionally, as a side benefit, after a bit of empirical testing it seemed like the sampled time in the DXGI frame statistics was always earlier than the time that I could sample myself, and so it seems like it is probably closer to the actual vblank than anything that I knew how to measure.

(The somewhat similar DwmGetCompositionTimingInfo() did not end up being as useful for me as I had initially thought. Alternatively, D3DKMTGetScanLine() seems like it could, in theory, be used for even more accurate results, but it wasn’t tied to discrete frame counters which made it more daunting. If my end goal had been just this particular demo I might have tried using that, but for my actual game engine renderer it seemed like IDXGISwapChain::GetFrameStatistics() would be easier, simpler and more robust.)

The problem that I ran into, however, is that I couldn’t find satisfactory explanations of what the fields of DXGI_FRAME_STATISTICS actually mean. I had to spend a lot of time doing empirical tests to figure it out myself, and I am going to document my findings here. If you found this post using an internet search for any of these DXGI_FRAME_STATISTICS-related terms then I hope this explanation saves you some time. (Alternatively, if you are reading this and find any mistakes in my understanding then please comment with corrections both for me and other readers.)

My Findings

The results of IDXGISwapChain::GetFrameStatistics() are a snapshot in time.

If you call IDXGISwapChain::GetLastPresentCount() immediately after IDXGISwapChain::Present() you will get the correct identifier for the present call that you just barely made, and this is very important to do in order to be able to correctly associate an individual present function call that you made with the information in the DXGI frame statistics (or, at least, it is conceptually important to do; you can also just keep track yourself of how many successful requests to present have been made).

On the other hand, if you call IDXGISwapChain::GetFrameStatistics() immediately after IDXGISwapChain::Present() there is no guarantee that you will get updated statistics (and, in fact, you most likely won’t). Instead, there is some non-deterministic (for you) moment in time after calling IDXGISwapChain::Present() where you would eventually get statistics for that specific request to present in the results of a call to IDXGISwapChain::GetFrameStatistics().

How do you know if the statistics you get are the ones that you want? You know that they are the ones that you want if the PresentCount field matches the value you got from IDXGISwapChain::GetLastPresentCount() after IDXGISwapChain::Present(). (UPDATE! The previous sentence, unfortunately, is not always true. See the end of this section for more information.) Once you call IDXGISwapChain::GetFrameStatistics() and get a PresentCount that matches the one that you’re looking for then you know two things:

The statistics that you now have refer to the known state of things when your submitted request to present (made by your call to IDXGISwapChain::Present() ) was actually presented
The statistics that you now have will not be updated again for your specific present request. What you now have is the snapshot that was made for your PresentCount , and no more snapshots will be made until another call to IDXGISwapChain::Present() is made (which means that the next time the statistics get updated they will be referring to a different PresentCount from the one that you are currently interested in).

Once you have a DXGI_FRAME_STATISTICS that is a snapshot for your specific PresentCount the important corresponding number is PresentRefreshCount. This tells which refresh of the display your request to present was actually presented during. If vsync is enabled PresentRefreshCount is the refresh of the display when your request to present was actually presented.

Once you have that information you can, incidentally, detect whether your request to present actually happened when you wanted and expected it to. This is described at https://learn.microsoft.com/en-us/windows/win32/direct3ddxgi/dxgi-flip-model#avoiding-detecting-and-recovering-from-glitches, in the “to detect a glitch” section. Although the description of what PresentCount and PresentRefreshCount is confusing to me in that document (and in other official documentation) the description of how to detect a glitch is consistent in my mind with how I have described these fields above, which helps to give me confidence that my understanding is probably correct.

Once you know the information above you can now potentially get timing information. The SyncRefreshCount refers to the same thing as PresentRefreshCount (it is a counter of display refresh cycles), and so it may be confusing why two different fields exist and what the distinction is between the two. PresentRefreshCount is, as described above, a mapping between PresentCount and a display refresh. SyncRefreshCount, on the other hand, is a mapping between the value in SyncQPCTime and a display refresh. The value in SyncQPCTime is a timestamp corresponding to the refresh in SyncRefreshCount. If SyncRefreshCount is the same as PresentRefreshCount then you know (approximately) the time of the vblank when your PresentCount request was actually displayed. It is possible, however, for SyncRefreshCount to be different from PresentRefreshCount, and that is why both fields are in the statistics struct.

To repeat: Information #1 is which display refresh your request was actually displayed in (comparing PresentCount to PresentRefreshCount) and information #2 is what the (approximate) time of a vblank for a specific refresh was (comparing SyncQPCTime to SyncRefreshCount). Derived information #3 is what the (approximate) time of a vblank was for the refresh that your request was actually displayed in.

(Side note: The official documentation here and here is very intentionally vague about when SyncQPCTime is actually measured. The driver documentation here, however, says “CPU time that the vertical retrace started”. I’m not sure if the more accessible user-facing documentation is intentionally vague to not be held accountable for how accurate the timing information is, or if the driver documentation is out-of-date. This post chooses to believe that the time is supposed to refer to the beginning of a refresh, with the caveat that I may be wrong and that even if I’m not wrong the sampled time is clearly not guaranteed to be highly accurate.)

One final thing to mention: A call to IDXGISwapChain::GetFrameStatistics() may return DXGI_ERROR_FRAME_STATISTICS_DISJOINT. One thing to note is that the values in PresentRefreshCount and SyncRefreshCount are monotonically-increasing and, specifically, they don’t reset even when the refresh rate changes. The consequence of this is that the DXGI_ERROR_FRAME_STATISTICS_DISJOINT result is very important for determining timing (like this post is concerned about). If you record the first PresentRefreshCount reported in the first successful call after DXGI_ERROR_FRAME_STATISTICS_DISJOINT was returned then you have a reference point for any future SyncRefreshCounts reported (until the next DXGI_ERROR_FRAME_STATISTICS_DISJOINT). Specifically, you know how many refresh cycles have happened with the current refresh rate.

Update: The documentation for the DXGI frame statistics has the following message which had confused me: “Note The number of times that an image was presented to the monitor is not necessarily the same as the number of times that you called IDXGISwapChain::Present or IDXGISwapChain1::Present1.” I didn’t understand how this could be the case, but thought that maybe it was referring to a situation where you called Present() and it failed. Unfortunately, I have since experienced the issue myself with a legitimate use case and no errors: If Present() is called with DXGI_PRESENT_RESTART such that some successfully-queued present is discarded then it can introduce a discrepancy into the counts! A call to IDXGISwapChain::GetLastPresentCount() will still return the number of successful calls to Present() (i.e. it is always monotonically-increasing), but the PresentCount reported in the frame statistics only refers to how many of the requested presents were actually presented (i.e. it doesn’t skip any counts when things in the present queue get discarded). This is frustrating because I don’t know of a way to know whether a present count is actually canceled or not when using DXGI_PRESENT_RESTART (it probably depends on when the call to Present() with that flag happens and how long it takes). If anyone reading this knows how to figure this out for sure please comment!

How to Calculate Refresh Period

Calculating the refresh period using SyncRefreshCount and SyncQPCTime is not difficult: Average the elapsed time between the sampled timestamps of refreshes. I am using the incremental method described here: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford’s_online_algorithm. This is easy to calculate and doesn’t require any storage beyond the current mean and the sample count. It can have problems with outliers or if the duration changes, though, and although I don’t anticipate either of those being issues it remains to be seen.

How to Predict VBlanks

I did some thinking about how to do this, and after some internet searching (I am not a numerical methods expert and so it took me a while to even figure out the correct search terms for what I was thinking) I found a really nice post about how to do exactly what I wanted, an incrementally-updating method for calculating the line that is the least-squares best fit for a bunch of sample points: https://blog.demofox.org/2016/12/22/incremental-least-squares-curve-fitting/. I liked this because it was a match for the incremental average function I was using, and since refresh cycles should happen regularly I figured that I could use SyncRefreshCount as the independent variable and SyncQPCTime as the dependent variable and then have a really computationally cheap way of predicting the time for any given refresh (in the past or in the future).

The good news is that this worked really well after some tweaking of my initial naive approach! The bad news is that the changes that I had to make in order to make it work well made me nervous about whether it would continue to perform well over time.

The big problem was losing precision. The SyncRefreshCount are inherently big numbers, but I already had to do some normalizing anyway so that they started at zero (see discussion above about DXGI_ERROR_FRAME_STATISTICS_DISJOINT) and so that didn’t seem so bad. SyncQPCTime, however, are also big numbers. The same trick of starting at zero can be used, and I also represented them as seconds (instead of the Windows high performance counts) and this helped me to get good results. I was worried about the long-term viability of this, however: Unlike the incremental method for the average this method required squaring numbers and multiplying big numbers together, and these numbers would constantly increase over time.

Even though I was quite happy with finding an algorithm that did what I had thought of, once I had implemented it there was still something that bothered me: I was trying to come up with a line equation, where the coefficients are the slope and the y-intercept. I already knew the slope, though, because I had a very good estimate of the duration of a refresh. In other words, I was solving for two unknowns using a bunch of sample points, but I already knew one of those unknowns! What I really wanted was to start with the slope and then come up with an estimate of the y-intercept only, and so it felt like the method I was using should be constrainable even more.

With that in mind I eventually came up with what I think, in hindsight, is a better solution even aside from precision issues. I know the “exact” duration between every vblank (we will conceptually consider that to be known, even though it’s an estimate), and for each reported sample I know the exact number of refreshes since the initial starting point (which is a nice discrete integer value), and then I know the approximate sampled time, which is the noisy repeated sample data I am getting that I want to improve in order to make predictions. What I can do, then, is to calculate what the initial starting time (for refresh count 0) would be, and incrementally calculate an average of that. This gives me the same cheap way of calculating the prediction (just a slope (refresh period) and a y-intercept (this initial timestamp)), but also a cheap way of updating this estimate (using the same incrementally-updating average method that I discussed above). And, even better, I can update it every frame without worrying about numerical problems. (Eventually with enough sample counts there will be issues with the updated value being too small, but that won’t impact the accuracy of the current average if we assume that it is very accurate by then.)

This means that I don’t have to spend time initially waiting for the averages to converge; instead I can just start with single sample that is already a reasonably good estimate and then proceed with normal rendering, knowing that my two running averages will keep getting more accurate over time with every new DXGI frame statistic that I get with new SyncRefresh information.

Type Trait Coding Style

One of my goals while creating a custom game engine is to avoid using the C++ standard library. This post won’t discuss why, but one consequence of that goal that I have run into during this early phase of the project is that I have had to recreate some fundamental machinery that the standard library provides. By “recreate” I don’t mean that I have figured things out myself starting from nothing but first principles, but instead that I have had to look at existing documentation and implementations to understand how something works and then reimplement it in my own style. It has been fun to learn a bit more about how this corner of C++ metaprogramming works; although I have used these features (some of them frequently) I have only vaguely understood how they might have actually been implemented.

Doing this has also had the interesting consequence of presenting me with challenges to my existing coding style and I have had to expand and adapt. This post gives some examples of changes to my coding style that I have been experimenting with in order to accommodate type traits.

Existing Style

My current style uses a lowercase single letter prefix to indicate type names, meaning that if an identifier is spelled e.g. jSomething or pSomething the reader can immediately know that those names identify types, even without knowing what the j or p might mean (and, to be clear, those are just examples, and neither the j nor the p prefix exists (yet)).

Class names start with a c prefix:

class cMyClass;
class cSomeOtherClass;

Struct names start with an s prefix, base class names start with an i prefix (for “interface”, where my goal is that leaf classes are the only ones that are ever instantiated), and enumeration names start with an e prefix:

// When I use a struct instead of a class it means:
//	* The member variables are public instead of private
//	* The member variables use a different naming scheme
struct sMyStruct;
// The reader can tell that the following is a base class
// (and conceptually an abstract base class)
// just from the type name alone:
class iMyBaseClass;
// Using this enum convention makes it
// less annoying to use scope enumerations
// because the prefix makes it instantly identifiable
// and so the identifiers can be chosen accordingly:
enum class eMyState
{
	On,
	Off,
};

Type names start with a t prefix, e.g.:

// Creating a type alias:
using tMySize = int;
// In templates:
template<typename tKey, typename tValue>
class cMyMap;

(I don’t like the common convention of using single letters (e.g. T and U) in templates, and find that it makes code much harder to read for me personally, similarly to how I feel about single letter variable names. This strongly-held conviction has been challenged in some cases by working with type traits, which I discuss below.)

Type Names

Some of the type trait information that I have needed are expressions that are types. The standard uses a _t suffix for this, which is a helper type of a struct, but I have never loved this convention. In my code so far I have used a t prefix for cases like, and hidden the associated struct in a details namespace:

// tNoConst -> Type without a const qualifier
namespace Types::detail
	{
		template<typename T> struct sNoConst { using type = T; };
		template<typename T> struct sNoConst<const T> { using type = T; };
	}
template<typename T>
using tNoConst = typename Types::detail::sNoConst<T>::type;

// An example of this being used:
const int someConstVariable = 0;
tNoConst<decltype(someConstVariable)> someMutableVariable;
someMutableVariable = 0;

This use of the t prefix is not really any different from how I had already named types, and I find that it fits in naturally when used in code. Eagle-eyed readers may notice, however, that I am using a single T as a type, which I had mentioned is something that I strongly dislike and claimed that I don’t do!

In all of my previous template programming I have always been able to come up with some name that made sense in context, even if it was sometimes generic like tValue. While working on these type traits, however, I realized that there are cases (like shown above) where the type really could be any type. I considered tType, but that seemed silly. I considered tAny, and I might still end up changing my mind and refactoring the code to that (or something similar). For now, though, I have capitulated and am using just the T for fully generic fundamental type trait building blocks like the ones discussed in this post (in other code, though, I still intend to strictly adhere to my give-the-type-a-name rule).

Value Names

Some of the type trait information that I have needed are expressions that are values. The standard uses a _v suffix for this, but I have never loved this convention. In fact, I’m not sure that I really understand this convention; unlike with _t where there needs to be some underlying struct for the metaprogramming implementation to work it doesn’t seem like values need this (at least, the ones that I have recreated so far haven’t needed an underlying struct).

I did struggle a bit with how to name these, however. My existing coding convention would prefix global variables names with g_ (that will have to be discussed in a separate post), but these type trait variables feel different from traditional global variables to me. In my mind they conceptually feel more like functions than variables, but functions that I call with <> instead of (). I wanted some alternate convention to make them visually distinct from standard variables.

After some experimentation I eventually settled on keeping the v from the standard but making it a prefix instead of a suffix, and I have been pretty happy so far with the result:

// vIsConst -> Whether a type is const-qualified
template<typename T>
constexpr bool vIsConst = false;
template<typename T>
constexpr bool vIsConst<const T> = true;

// An example of this being used:
if constexpr (vIsConst<decltype(someVariable)>)
{
	// Stuff
}

This convention has added a new member to my pantheon of prefixes, but it has felt natural and like a worthy addition so far. As an additional unexpected bonus it has also given me a new convention for naming template non-type parameters:

// My new convention:
template<typename tSomeType, bool vSomeCondition>
class myClass1;

// My previous convention, which I never loved:
template<typename tSomeType, bool tSomeCondition>
class myClass2;

Having a new way of unambiguously specifying compile-time values has improved the readability of my code for me.

Concept Names

I have encountered one case where I wanted to make a named constraint and I had to think about what to name it. I don’t have enough experience yet to know whether my initial attempt is something that I will end up liking, but this is what I have come up with:

// rIsBaseOf -> Enforces vIsBaseOf
template<typename tDerived, typename tBase>
concept rIsBaseOf = vIsBaseOf<tBase, tDerived>;

// An example of this being used:
template<rIsBaseOf<iBaseClass> tSomeClass>
class cMyConstrainedClass;

I couldn’t use c for “constraint” or “concept” because that was already taken for classes. I finally settled on r for “restraint” (kind of like a mix of “constraint” and “restrict”, with a suggestion of requires) and I don’t hate it so far but I also don’t love it. It feels like it is good enough to do the job for me in my own code, but it also feels like maybe there’s a better convention that I haven’t thought of yet.

Application with an Empty Window

Behold, the main window of an application using my nascent engine:

It does as little as one might expect from the screenshot, but some of the architecture might be of interest.

Build System

This section shows parts of the jpmake project file.

Platforms and Configurations

The different possible platforms and configurations are defined as follows:

-- Platforms
------------

-- Define the platforms by name and type

platform_windows = DefinePlatform("windows", "win64")
local platform_windows = platform_windows

-- Get the specified platform to execute tasks for
platform_target = GetTargetPlatform()
local platform_target = platform_target

-- Configurations
-----------------

-- Define the configurations by name

configuration_unoptimized = DefineConfiguration("unoptimized")
local configuration_unoptimized = configuration_unoptimized

configuration_optimized = DefineConfiguration("optimized")
local configuration_optimized = configuration_optimized

configuration_profile = DefineConfiguration("profile")
local configuration_profile = configuration_profile

configuration_release = DefineConfiguration("release")
local configuration_release = configuration_release

-- Get the specified configuration to execute tasks for
local configuration_current = GetCurrentConfiguration()

I am creating global variables for each platform and configuration so that they are accessible by other Lua files and then immediately assigning them to local variables so that they are cheaper to use in this file. (I currently only have the single Lua project file, but soon I will want to change this and have separate files that can focus on different parts of the project.)

At the moment I am specifying any platform-specific information using a strategy like if platform_target == platform_windows and that works fine (there are several examples later in this post), but I am considering defining something like isWindows = platform_target == platform_windows instead. There won’t be many platforms (only one for the foreseeable future!) and it seems like it would be easier to read and write many platform-specific things with a single boolean rather than with a long comparison. I am doing something similar with the configurations where I define booleans that serve as classification descriptions, and so far it feels nice to me (again, there are examples later in this post).

Directory Structure

The directory structure from the perspective of jpmake is currently defined as follows:

-- Source Files
do
	SetEnvironmentVariable("engineDir", "Engine/")
	SetEnvironmentVariable("appsDir", "Apps/")
end
-- Generated Files
do
	-- Anything in the temp directory should be generated by jpmake executing tasks
	-- and the entire folder should be safely deletable.
	-- Additionally, any files that are not part of the Git repository
	-- should be restricted to this folder.
	SetEnvironmentVariable("tempDir", ConcatenateTable{"temp/", platform_target:GetName(), "/", configuration_current, "/"})
	-- The intermediate directory is for files that must be generated while executing tasks
	-- but which aren't required to run the final applications
	SetEnvironmentVariable("intermediateDir", "$(tempDir)intermediate/")
	SetEnvironmentVariable("intermediateDir_engine", "$(intermediateDir)engine/")
	-- The artifact directory is where jpmake saves files
	-- that it uses to execute tasks
	SetArtifactDirectory("$(intermediateDir)jpmake/")
	-- The staging directory contains the final applications that can be run
	-- independently of the source project and intermediate files
	SetEnvironmentVariable("stagingDir", "$(tempDir)staging/")
end

As a general rule I don’t like abbreviations in variable or function names but I decided to keep the “dir” convention from Visual Studio since these environment variable names will be used so frequently in paths that it seems like a reasonable exception to keep things shorter and more readable. (I did, however, decide to change the first letter to lowercase which fits with my variable naming convention better.)

An issue that I have run into in the past is having trouble deciding how to name directory environment variables to distinguish between source and generated files, and with games where there can be code and assets both for the engine and the application the possible choices are even more complex (and, to make matters worse, with this project I am intending to support multiple applications using the engine and so there is yet a further distinction that must be made). What I have will likely change as time goes on and I write more code, but it feels like a good start. The root repository folder looks like this:

Any files that are generated by the build process are kept quarantined in a single folder (temp/) so that the distinction between source and deletable files is very clear. This is very important to me (as anyone who has worked with me can attest). The temp directory looks like the following, expanded for one platform and configuration:

With such a simple application the only thing in the staging directory is the executable file, but when I develop more complicated applications there will be other files in staging directories (e.g. the assets that the game loads).

One further consideration that is currently missing is what to do with “tools”, those programs that are used during development (either for authoring content or as part of the build process) but that don’t get released to end users. I can imagine that I might want to update some of this directory structure when I start developing tools.

C++ Configuration

The next section in the jpmake project file configures the default settings for how C++ is built for the current target platform and build configuration:

-- C++
------

-- Initialize C++ for the current platform and configuration
cppInfo_common = CreateCppInfo()
local cppInfo_common = cppInfo_common
do
	-- #define VLSH_PLATFORM_SOMENAME for conditional compilation
	do
		local platform_define_suffix
		if (platform_target == platform_windows) then
			platform_define_suffix = "WINDOWS"
		else
			platform_define_suffix = "NONE"
		end
		cppInfo_common:AddPreprocessorDefine(("VLSH_PLATFORM_" .. platform_define_suffix),
			-- There isn't any anticipated reason to check anything other than whether the platform is #defined,
			-- but the name is used as a value because why not?
			platform_target:GetName())
	end
	-- The project directory is used as an $include directory
	-- so that directives like the following can be done to show scope:
	--	#include <Engine/SomeFeature/SomeHeader.hpp>	
	cppInfo_common:AddIncludeDirectory(".")
	local isOptimized = configuration_current ~= configuration_unoptimized
	cppInfo_common:AddPreprocessorDefine("VLSH_CONFIGURATION_ISOPTIMIZED", isOptimized)
	local isForProfiling = configuration_current == configuration_profile
	cppInfo_common:AddPreprocessorDefine("VLSH_CONFIGURATION_ISFORPROFILING", isForProfiling)
	local isForRelease = configuration_current == configuration_release
	cppInfo_common:AddPreprocessorDefine("VLSH_CONFIGURATION_ISFORRELEASE", isForRelease)
	do
		local areAssertsEnabled = not isForRelease and not isForProfiling
		cppInfo_common:AddPreprocessorDefine("VLSH_ASSERT_ISENABLED", areAssertsEnabled)
	end
	cppInfo_common.shouldStandardLibrariesBeAvailable = false
	cppInfo_common.shouldPlatformLibrariesBeAvailable = false
	cppInfo_common.shouldExceptionsBeEnabled = false
	cppInfo_common.shouldDebugSymbolsBeAvailable =
		-- Debug symbols would also have to be available for release in order to debug crashes
		not isForRelease
	if platform_target == platform_windows then
		cppInfo_common.VisualStudio.shouldCRunTimeBeDebug = not isOptimized
		cppInfo_common.VisualStudio.shouldIncrementalLinkingBeEnabled =
			-- Incremental linking speeds up incremental builds at the expense of bigger executable size
			not isForRelease
		-- Warnings
		do
			cppInfo_common.VisualStudio.shouldAllCompilerWarningsBeErrors = true
			cppInfo_common.VisualStudio.shouldAllLibrarianWarningsBeErrors = true
			cppInfo_common.VisualStudio.compilerWarningLevel = 4
		end
	end
end

This shows the general approach I am taking towards configuring things (both from the perspective of the game engine and also from the perspective of jpmake and my personal ideal way of configuring software builds). The named configurations (e.g. unoptimized, optimized, profile, release) that I defined earlier are just arbitrary names from the perspective of jpmake and don’t have any semantics associated with them. Instead it is up to the user to specify how each configuration behaves. I can imagine that this would be seen as a negative for most people, but I have a personal issue where I generally prefer to have full control over things.

This section should not be understood as being complete (most notably there actually aren’t any optimization-related settings except for which C run-time to use!) but that is because I haven’t implemented all of the Visual Studio options in jpmake yet.

Engine Static Library

Below is one example of a static library that I have made, which provides base classes for applications (meaning that an actual application can inherit from the provided framework):

do
	local task_application = CreateNamedTask("Application")
	local cppInfo_application = cppInfo_common:CreateCopy()
	do
		if (platform_target == platform_windows) then
			cppInfo_application.shouldPlatformLibrariesBeAvailable = true
		end
	end
	engineLibrary_application = task_application:BuildCpp{
			target = "$(intermediateDir_engine)Application.lib", targetType = "staticLibrary",
			compile = {
				"$(engineDir)Application/iApplication.cpp",
				"$(engineDir)Application/iApplication_windowed.cpp",
				platform_target == platform_windows and "$(engineDir)Application/iApplication_windowed.win64.cpp" or nil,
			},
			link = {
				engineLibrary_assert,
				platform_target == platform_windows and CalculateAbsolutePathOfPlatformCppLibrary("User32.lib", cppInfo_application) or nil,
			},
			info = cppInfo_application,
		}
end
local engineLibrary_application = engineLibrary_application

My current plan is to have the “engine” consist of a collection of static libraries that all get linked into the single application executable.

This named task shows a file specific to Windows that is only compiled for that platform (iApplication_windowed.win64.cpp, where my convention is to try to put as much platform-specific code in separate platform-specific CPP files as possible and then those files have the platform name as a sub-extension), as well as a Windows library that is only needed for linking on that platform (User32.lib) and another static library (engineLibrary_assert, which was defined earlier but that I don’t show in this blog post) that this static library depends on.

As more files get created that are specific to one platform or another I think my style will have to change to make it less annoying to conditionally specify each one.

Applications

Finally, the two proof-of-concept applications that I have created are defined as follows:

-- Hello World
--============

do
	do
		SetEnvironmentVariable("appDir", "$(appsDir)HelloWorld/")
		SetEnvironmentVariable("stagingDir_app", "$(stagingDir)HelloWorld/")
	end
	local cppInfo_helloWorld = cppInfo_common:CreateCopy()
	do
		-- For std::cout
		cppInfo_helloWorld.shouldStandardLibrariesBeAvailable = true
	end
	do
		local helloWorld_task = CreateNamedTask("HelloWorld")
		local application_subTask = helloWorld_task:BuildCpp{
				target = "$(stagingDir_app)HelloWorld.exe", targetType = "consoleApplication",
				compile = {
					"$(appDir)EntryPoint.cpp",
				},
				info = cppInfo_helloWorld,
			}
		helloWorld_task:SetTargetForIde(application_subTask)
	end
end

-- Empty Window
--=============

do
	do
		SetEnvironmentVariable("appDir", "$(appsDir)EmptyWindow/")
		SetEnvironmentVariable("stagingDir_app", "$(stagingDir)EmptyWindow/")
	end
	local cppInfo_emptyWindow = cppInfo_common:CreateCopy()
	do
		cppInfo_emptyWindow:AddIncludeDirectory("$(appDir)")
	end
	do
		local emptyWindow_task = CreateNamedTask("EmptyWindow")
		local application_subTask = emptyWindow_task:BuildCpp{
				target = "$(stagingDir_app)EmptyWindow.exe", targetType = "windowedApplication",
				compile = {
					"$(appDir)EntryPoint.cpp",
				},
				link = {
					engineLibrary_application,
				},
				info = cppInfo_emptyWindow,
			}
		emptyWindow_task:SetTargetForIde(application_subTask)
	end
end

These show the general approach towards making executable applications that I am envisioning, although these both are as simple as possible.

One idiom that I discovered is reusing the same environment variable names but setting them to different values for different applications. This allowed the names to be shorter and thus more readable (before this I had different versions with _helloWorld and _emptyWindow), but I don’t have enough experience to decide if this will work well long term.

The examples also show calls to SetTargetForIde(), which has no effect when executing tasks but is instead used when generating the solution files so that Visual Studio will correctly have its $(TargetPath) set, which makes setting up debugging easier.

Visual Studio Solution

It is now possible for jpmake to generate Visual Studio solution and project files. I did this work to make it easier to write code and debug in Visual Studio. The Solution Explorer currently looks like the following for the jpmake project that I have been showing in this post:

And the properties of the EmptyWindow project have some things filled in:

I had to spend more time on generating these files and additional jpmake features than I had initially anticipated before working on the engine code because I wasn’t able to debug, which felt like a requirement. With the way it works now, however, I was able to write the empty window application and things worked reasonably well.

I did have one discouraging realization, however, which is that Intellisense doesn’t work yet. I was able to complete the empty window application without it but it was more annoying than I would have anticipated. I think I need to take some more time to improve jpmake so that Intellisense will work at least somewhat because not having it has proven to be an annoying impediment.