Approval Testing

Abstract

An automated software testing technique is presented which spares us from having to stipulate our expectations in test code, and from having to go fixing test code each time our expectations change.

(Useful pre-reading: About these papers)

The Problem

The most common scenario in automated software testing is ensuring that given specific input, a component-under-test produces expected output. The conventional way of achieving this is by feeding the component-under-test with a set of predetermined parameters, obtaining the output of the component-under-test, comparing the output against an instance of known-good output which has been hard-coded within the test, and failing the test if the two are not equal.

This approach works, but it is inefficient, because during the development and evolution of a software system we often make changes to the production code fully anticipating the output of certain components to change. Unfortunately, each time we do this, the tests fail, because they are still expecting the old output. So, each change in the production code must be followed by a round of fixing tests to make them pass.

Note that under Test-Driven Development things are not any better: first we modify the tests to start expecting the new output, then we observe them fail, then we modify the components to produce the new output, then we watch the tests pass. We still have to stipulate our expectations in test code, and we still have to change test code each time our expectations change, which is inefficient.

This imposes a considerable burden on the software development process. As a matter of fact, it often happens that programmers refrain from making needed changes to their software because they dread the prospect of having to fix all the tests that will break as a result of those changes.

Approval Testing is a technique for automated software testing which aims to correct all this.

The Solution

Under Approval Testing, the assertions that verify the correctness of the output of the component-under-test are abolished, and replaced with code that simply saves the output to a text file. This text file is known as the test output file.

The test may still fail if the component-under-test encounters an error while producing output, causing an exception to be thrown, in which case we follow a conventional test-troubleshoot-fix-repeat workflow. However, if no error is encountered, then the test completes successfully without examining the test output file.

The trick is that the test output file is saved right next to the source code file of the test, which means that it is kept under Version Control. In the most common case, each test run produces the exact same output as the previous run, so nothing changes, meaning that all is good. If a test run produces different output from a previous test run, then the Version Control System indicates that the test output file has been modified and is in need of committing. Thus, the developer cannot fail to notice that the test output has changed.

The developer can then utilize the "Compare with unmodified" feature of the Version Control System to see the differences between the test output that was produced by the modified code, and the test output of the last known-good test run. By visually inspecting these differences, the developer can decide whether they are as expected or not, according to the changes they made in the code.

  • If the observed differences are not as expected, then the developer needs to keep working on their code until they are.
  • If the observed differences are as expected, then the developer can simply commit the new code, along with the new test output file, and they are done.

This way, we eliminate the following burdens:

  • Having to hard-code into the tests the output expected from the component-under-test.
  • Having to write code, in each test, which asserts that the output of the component-under-test matches the expected output.
  • Having to go fixing test code each time there is a fully expected change in the output of the component-under-test.

The eliminated burdens are traded for the following much simpler responsibilities:

  • The output of the component-under-test must be converted to text and written to a test output file.
  • When the version control system shows that a test output file changed after a test run, the differences must be reviewed, and a decision must be made as to whether they are as expected or not.
  • Production code and testing code must be written with some noise reduction concerns in mind. (More on that further down.)

This represents a considerable optimization of the software development process.

Note that the arrangement is also convenient for the reviewer, who can see both the changes in the code and the resulting changes in the test output files.

As an added safety measure, the continuous build pipeline may deliberately fail the tests if an unclean working copy is detected after running the tests, because that would mean that the tests produced different results from what was expected, or that someone failed to commit an updated test output file.

Noise reduction

For Approval Testing to work effectively, all test output must be completely free of noise. By noise we mean:

  • Two test runs of the exact same code producing different output.
  • A single change in the code producing wildly different output.

For example, if a test emits the username of the current user into the output file, then the output file generated by that test will be different for every user that runs it, which in turn means that for most people it will be different from the one that was committed.

Noise is undesirable, because:

  1. Needlessly modified test output files are a false cause of alarm.
  2. Examining changes in test output files only to discover that they are due to noise is a waste of time.
  3. A change that might be important to notice may be lost in the noise.

Noise in test output files is most commonly caused by various sources of non-determinism, such as:

  • Wall-clock time.

    As the saying goes, the arrow of time is always moving forward. This means that the "current" time coordinate is always different from test run to test run, and this in turn means that if any wall-clock timestamps find their way into the test output, the resulting test output file will always be different from the previous run. So, for example, if your software generates a log, and you were thinking of using the log as your test output, then you will have to either remove the timestamps from the log, or fake them. Faking the clock for the purpose of testing is a well-known best practice anyway, regardless of approval testing. To accomplish this, create a "Clock" interface, and propagate it to every place in your software that needs to know the current time. Create two implementations of that interface: one for production, which queries the actual wall-clock time from the operating environment, and one for testing, which starts from some fixed, known origin and increments by a fixed amount each time it is queried.

  • Random number generation.

    Random number generators are usually pseudo-random, and we tend to make them practically random by seeding them with the wall-clock time. This can be easily fixed for the purpose of testing by seeding them with a known fixed value instead. Some pseudo-random generators seed themselves with the wall-clock time without allowing us to override this behavior; this is deplorable. Such generators must be faked in their entirety for the purpose of testing. This extends to any other constructs that employ random number generation, such as GUIDs/UUIDs: they must also be faked when testing, using deterministic generators.

  • Multi-threading.

    Multiple threads running in parallel tend to exhibit unpredictable timing irregularities, and result in a chaotically changing order of events. If these threads affect test output, then the order of the content of the test output file will be changing on every test run. For this reason, multi-threading must either be completely avoided when testing, or additional mechanisms (queuing, sorting, etc.) must be employed to guarantee a consistent ordering of test output.

  • Floating-point number imprecision.

    Floating-point calculations can produce slightly different results depending on whether optimizations are enabled or not. To ensure that the test output file is unaffected, any floating point values emitted to the test output file must be rounded to as few digits as necessary. At the very least, they must be rounded to one digit less than their full precision.

  • Other external factors.

    User names, computer names, file creation times, IP addresses resolved from DNS, etc. must either be prevented from finding their way into the test output file, or they must be faked when running tests. Fake your file-system; fake The Internet if necessary. For more information about faking stuff, see Testing with Fakes instead of Mocks.

In short, anything that would cause flakiness in software tests will cause noisiness in Approval Testing.

Additionally, the content of test output files can be affected by some constructs that are fully deterministic in their nature. These constructs will never result in changed test output files without any corresponding changes in the code, but may produce drastically different test output files as a result of only minute changes in the code. For example:

  • Hash Table Rehashing.

    A hash table may decide to re-hash itself as a result of a single key addition, if that addition happens to cause some internal load factor threshold to be exceeded. Exactly when and how this happens depends on the implementation of the hash table and we usually have no control over it. After re-hashing, the order in which the hash table enumerates its keys is drastically different, and if the keys are emitted to the test output, then the contents of the test output file will be drastically different. To avoid this, replace plain hash tables with hash tables that retain the order of key insertion.

  • Insufficient Sorting Keys.

    When sorting data, the order of items with identical keys is undefined. It is still deterministic, but the addition or removal of a single item can cause all items with the same sorting key to be arbitrarily rearranged. To avoid this, always use a full set of sorting keys when sorting data, so as to give every item a specific unique order. Introduce additional sorting keys if necessary, even if you would not normally have a use for them.

Noise reduction aims to ensure that:

  • there will never be any changes in the test output unless there have been some changes in the code, and
  • for every unique change in the code we will see a specific expected set of changes in the test output, instead of a large number of irrelevant changes.

This ensures that the single change in the test output that matters will not be lost in the noise, and makes it easier to determine that the modifications we made to the code have exactly the intended consequences and not any unintended consequences.

Note that in some cases, noise reduction can be implemented in the tests rather than in the production code. For example, instead of replacing a plain hash table with an ordered hash table in production code, our test can obtain the contents of the plain hash table and sort them before writing them to the test output file. However, this may not be possible in cases where the hash table is several transformations away from the testing. Thus, replacing a plain hash table with an ordered hash table may sometimes be necessary in production code.

Noise reduction in production code can be either always enabled, or only enabled during testing. The most performant choice is to only have it enabled during testing, but the safest choice is to have it always enabled.

Failure Testing

Failure Testing is the practice of deliberately supplying the component-under-test with invalid input and ensuring that the component-under-test detects the error and throws the appropriate exception, filled in with the appropriate details. Such scenarios can leverage Approval Testing by simply catching exceptions and serializing them, as text, into the test output file.

Applicability

Approval Testing is most readily useful when the Component Under Test produces results as text, or results that are directly translatable to text. With a bit of effort, any kind of output can be converted to text, so Approval Testing is universally applicable.

Must test output files be committed?

It is in theory possible to refrain from keeping test output files in the source code repository, but doing so would have the following disadvantages:

  • It would deprive the code reviewer from the convenience of being able to see not only the changes in the code, but also the differences that these changes have introduced in the output of each test.
  • It would require the developer to always remember to immediately run the tests each time they pull from the source code repository, so as to have the unmodified test output files produced locally, before proceeding to make modifications to the code which would further modify the test output files.
  • It would make it more difficult for the developer to take notice when the test output files change.
  • It would make it more difficult for the developer to see diffs between the modified test output files and the unmodified ones.

Of course all of this could be taken care of with some extra tooling. What remains to be seen is whether the effort of developing such tooling can be justified by the mere benefit of not having to store test output files in the source code repository.

Conclusion

Approval Testing is a universally applicable technique for automated software testing which can significantly reduce the effort of writing and maintaining tests by sparing us from having to stipulate our expectations in test code, and from having to go fixing test code each time our expectations change.

Addendum: History

I independently invented this testing technique back in 2017, when I was writing Java code that was generating bytecode and I desperately needed my tests to stop failing each time I wanted to start producing a slightly different bytecode sequence accomplishing the same thing in a slightly better way.

In the beginning I had not given this technique any name, it was just an awesome trick employed by some of my tests that greatly improved the efficiency of testing.

Then, a few years later, probably around 2022, a colleague at work asked me for advice on how to go about testing some C# code that was producing G-code. Seeing the similarity between what he was doing and what I had done in the past, I described my technique to him. He went on to implement it, and he was quite happy with it, but when I looked at what he had done later, I did not like the terminology he had come up with. So, I decided to come up with a set of terms that I considered appropriate, and to write a blog post describing my idea in writing, for reference by all future generations of software developers in all eternity.

I decided to call the technique "Audit Testing" and the test output files "Audit Files".

Then, after a couple of more years, I discovered that my testing technique had (of course) already been thought of by others, and it already had a name: "Approval Testing". I found out about this while watching videos from the Modern Software Engineering channel on YouTube, which I highly recommend.

As it turns out, this testing technique has been independently invented by many people, who have come up with different names to describe their inventions, such as "Snapshot Testing" or "Golden Master Testing". I think that the term "Approval Testing", although not perfect, is preferable over those, and good enough to choose as the standard.

So, I re-visited this post, and I changed all of my terminology to speak of "Approval Testing" instead of "Audit Testing". I did this with some sadness, because the term "Audit Testing" is more versatile, allowing me to speak of "audit output" and "audit files", whereas the term "Approval Testing" does not seem to allow such versatility: terms like "approval output" and "approval file" do not sound right. So, when fixing the terminology used in this post I had to go with the more generic terms "test output" and "test output file", which are less expressive.

It should be noted that Dave Farland from Modern Software Engineering sees Approval Testing as primarily suitable for testing legacy code, whereas I view it as suitable for any kind of software development where the output of the component-under-test is expected to change, intentionally, as the code evolves. Since this is true of almost all software development, Approval Testing is applicable practically everywhere.

Addendum: Further reading

To see what others say about Approval Testing, see:

YouTube: Modern Software Engineering: Add APPROVAL TESTING To Your Bag Of Tricks (2023-03-03)

YouTube: Modern Software Engineering: Approval Tests vs Acceptance Tests: What's the Difference? (2026-02-20)

YouTube: Modern Software Engineering: Gain CONFIDENCE With Approval Testing (2023-09-10)


Cover image: "Approval Testing" by michael.gr.