For the past several months I've been refactoring the automation framework off and on for Unreal Engine 4. One of the things that we wanted to bring to the framework this time around was image comparison.
However rendered image comparison can be tricky because of differences caused by the following,
- Hardware Abstraction Layer
- Feature Level
- Hardware Specific Features
- Floating Point Precision
- ...and well any source of non-determinism.
All these things make it difficult to compare rendered output. You could simplify matters by only testing on one kind of machine but that's a pretty unrealistic testing environment.
I started by converting the comparison method from Resemble.js to C++, it's a straight forward image comparison JS library. It supports per-channel and brightness tolerances. It also did neighbor similarity to attempt to account for anti-aliasing. A similar library that looks like it might have a few more features is Blink-Diff, which I found later.
The first mistake I made was comparing the pixels across the whole image and generate a percent difference. Looking at the picture below, you can immediately see the problem with that approach.
The black pixels only represent a 1.85% difference in the image. The minimum required global error I defaulted to was 2% before I considered it a problem. Lowering the required error to 1% would have worked, but I wanted to maintain a large enough margin to avoid false positives coming from usual non-deterministic differences. 2% might still be too high, and I may lower it anyway, but still if a material effect breaks, it may only create localized distortions.
To solve this problem I ended breaking up the images into 100 blocks (a spatial hash). I then accumulated error per block as well as global, which ends up producing blocks with 30%-40% error in the sample above, which was plenty to overcome my new maximum allowed block error of 10%.
This still wasn't ideal, since depending on how the error shows up in the image it's possible it is spread across enough blocks in just such a way as to not trigger the maximum error in any block.
The problem with the block error is that it assumes a particular shape, when error could come in any shape. Imagine a particularly faulty outline shader, it might be very broken, but due to the way it's shaped it might not trigger either the local or global errors.
One idea I've been batting around is this idea of some kind of clustering error. Along the lines of having a small radius, say 3px radius, and then for every error pixel that can touch another error pixel within the radius, they merge into a cluster. The benefit here is that I can make tighter assumptions about error limits with clustering. Because it allows me to say, if you find a error cluster smaller than the global limit, but not insignificant (maybe 0.05% total pixels).
Demultiplexing The G-Buffer
One of the things the Unreal Engine automation screenshot comparison system supports now is the ability to use any of the G-Buffers as input. The reason for this, is that while the final color matters a lot. If you actually perform tests on the individual buffers before they are factored into the final pixel color, you may detect errors sooner because while the difference may be obvious if you looked at say the Ambient Occlusion buffer in isolation, it may not show up clearly when comparing final pixel color.
I'm considering just adding a checkbox that makes the screenshot test take a shot of every G-Buffer and compare them all for a given scene. It would be a real space hog but super handy for testing some advanced rendering features in a lot of dimensions easily.
Metadata & Alternatives
The part I'm hoping makes the approach I'm taking long lasting is the metadata I store for every image and the ability to store alternatives.
So I store the images like this,
Under the test folder, they're put into a folder made up of PLATFORM_RHI_SHADERMODEL.
This broadly separates the images based on at least the most significant contributors to differences.
The files themselves are based on a unique identifier for the hardware, so there is an assumption right now we need to have stable results for a given piece of hardware - but if the need arises for multiple images for the same hardware. I would hash additional things into the unique id for the shot.
Due to the non-deterministic nature of the shots, one of the features I ended up adding that may or may not end up being valuable is the concept of alternatives. In the event two shots are both right, the system permits additional shots to be added as ground truth, and when comparison time comes, the system will choose the shot that is closest in terms of metadata matching to compare against. Will just need to see how that option evolves - it may just end up being a quick way to deal with sudden changes, that eventually need to have additional high level options baked into the rough separation of shot groups.
The thing I'm hoping saves a lot of headaches is having a json file per shot containing the shot metadata. In addition to having the per-testing constraints, it has and will have more information about features and rendering options currently enabled, in addition to things like driver version, which in the diffing tool we can highlight changes to machines as possibly being the cause of differences.
I looked at some other comparison approaches starting with perceptual comparison algorithms like Structural Similarity and Perceptual Hashing, even added a prototype SSIM approach to UE4. The problem with these approaches is that they may hide the existence of real errors just because a human couldn't see them in the examples.