michael.gr: Audit Testing

Abstract:

An automated software testing technique is presented which spares us from having to stipulate our expectations in test code, and from having to go fixing test code each time our expectations change.

(Useful pre-reading: About these papers)

The Problem

The most common scenario in automated software testing is ensuring that given specific input, a component-under-test produces expected output. The conventional way of achieving this is by feeding the component-under-test with a set of predetermined parameters, obtaining the output of the component-under-test, comparing the output against an instance of known-good output which has been hard-coded within the test, and failing the test if the two are not equal.

This approach works, but it is inefficient, because during the development and evolution of a software system we often make changes to the production code fully anticipating the output of certain components to change. Unfortunately, each time we do this, the tests fail, because they are still expecting the old output. So, each change in the production code must be followed by a round of fixing tests to make them pass.

Note that under Test-Driven Development things are not any better: first we modify the tests to start expecting the new output, then we observe them fail, then we modify the components to produce the new output, then we watch the tests pass. We still have to stipulate our expectations in test code, and we still have to change test code each time our expectations change, which is inefficient.

This imposes a considerable burden on the software development process. As a matter of fact, it often happens that programmers refrain from making needed changes to their software because they dread the prospect of having to fix all the tests that will break as a result of those changes.

Audit Testing is a technique for automated software testing which aims to correct all this.

The Solution

Under Audit Testing, the assertions that verify the correctness of the output of the component-under-test are abolished, and replaced with code that simply saves the output to a text file. This file is known as the Audit File.

The test may still fail if the component-under-test encounters an error while producing output, in which case we follow a conventional test-fix-repeat workflow, but if the component-under-test manages to produce output, then the output is saved in the Audit File and the test completes successfully without examining it.

The trick is that the Audit File is saved right next to the source code file of the test, which means that it is kept under Version Control. In the most common case, each test run produces the exact same audit output as the previous run, so nothing changes, meaning that all is good. If a test run produces different audit output from a previous test run, then the tooling alerts the developer to that effect, and the Version Control System additionally indicates that the Audit File has been modified and is in need of committing. Thus, the developer cannot fail to notice that the audit output has changed.

The developer can then utilize the "Compare with unmodified" feature of the Version Control System to see the differences between the audit output that was produced by the modified code, and the audit output of the last known-good test run. By visually inspecting these differences, the developer can decide whether they are as expected or not, according to the changes they made in the code.

If the observed differences are not as expected, then the developer needs to keep working on their code until they are.
If the observed differences are as expected, then the developer can simply commit the new code, along with the new Audit File, and they are done.

This way, we eliminate the following burdens:

Having to hard-code into the tests the output expected from the component-under-test.
Having to assert, in each test, that the output of the component-under-test matches the expected output.
Having to go fixing test code each time there is a (fully expected) change in the output of the component-under-test.

The eliminated burdens are traded for the following much simpler responsibilities:

The output of the component-under-test must be converted to text and written to an audit file.
When the version control system shows that an audit file changed after a test run, the differences must be reviewed, and a decision must be made as to whether they are as expected or not.
Tests and production code must be written with some noise reduction concerns in mind. (More on that further down.)

This represents a considerable optimization of the software development process.

Note that the arrangement is also convenient for the reviewer, who can see both the changes in the code and the resulting changes in the Audit Files.

As an added safety measure, the continuous build pipeline can deliberately fail the tests if an unclean working copy is detected after running the tests, because that would mean that the tests produced different results from what was expected, or that someone failed to commit some updated audit file.

Noise reduction

For Audit Testing to work effectively, all audit output must be completely free of noise. By noise we mean:

Two test runs of the exact same code producing different audit output.
A single modification in the code producing wildly different audit output.

For example, if a test emits the username of the current user into the audit output, then the audit file generated by that test will be different for every user that runs it, even if the user does not modify any code.

Noise is undesirable, because:

Needlessly modified audit files are a false cause of alarm.
Examining changes in audit files only to discover that they are due to noise is a waste of time.
A change that might be important to notice can be lost in the noise.

Noise in audit files is most commonly caused by various sources of non-determinism, such as:

Wall-clock time.
As the saying goes, the arrow of time is always moving forward. This means that the "current" time coordinate is always different from test run to test run, and this in turn means that if any wall-clock timestamps find their way into the audit output, the resulting audit file will always be different from the previous run. So, for example, if your software generates a log, and you were thinking of using the log as your audit output, then you will have to either remove the timestamps from the log, or fake them. Faking the clock for the purpose of testing is a well-known best practice anyway, regardless of audit testing. To accomplish this, create a "Clock" interface, and propagate it to every place in your software that needs to know the current time. Create two implementations of that interface: one for production, which queries the actual wall-clock time from the operating environment, and one for testing, which starts from some fixed, known origin and increments by a fixed amount each time it is queried.
Random number generation.
Random number generators are usually pseudo-random, and we tend to make them practically random by seeding them with the wall-clock time. This can be easily fixed for the purpose of testing by seeding them with a known fixed value instead. Some pseudo-random generators seed themselves with the wall-clock time without allowing us to override this behavior; this is deplorable. Such generators must be faked in their entirety for the purpose of testing. This extends to any other constructs that employ random number generation, such as GUIDs/UUIDs: they must also be faked when testing, using deterministic generators.
Multi-threading.
Multiple threads running in parallel tend to exhibit unpredictable timing irregularities, and result in a chaotically changing order of events. If these threads affect audit output, then the ordering of the content of the audit file will be changing on every test run. For this reason, multi-threading must either be completely avoided when testing, or additional mechanisms (queuing, sorting, etc.) must be employed to guarantee a consistent ordering of audit output.
Floating-point number imprecision.
Floating-point calculations can produce slightly different results depending on whether optimizations are enabled or not. To ensure that the audit file is unaffected, any floating point values emitted to the audit file must be rounded to as few digits as necessary. At the very least, they must be rounded to one digit less than their full precision.
Other external factors.
User names, computer names, file creation times, IP addresses resolved from DNS, etc must either be prevented from finding their way into the audit output, or they must be faked when running tests. Fake your file-system; fake The Internet if necessary. For more information about faking stuff, see michael.gr - Software Testing with Fakes instead of Mocks.

In short, anything that would cause flakiness in software tests will cause noisiness in Audit Testing.

Additionally, the content of audit files can be affected by some constructs that are fully deterministic in their nature. These constructs will never result in changed audit files without any changes in the code, but may produce drastically different audit files as a result of only minute changes in the code. For example:

Hash Table Rehashing.
A hash table may decide to re-hash itself as a result of a single key addition, if that addition happens to cause some internal load factor threshold to be exceeded. Exactly when and how this happens depends on the implementation of the hash table and we usually have no control over it. After re-hashing, the order in which the hash table enumerates its keys is drastically different, and if the keys are emitted to audit output, then the audit file will be drastically different. To avoid this, replace plain hash tables with hash tables that retain the order of key insertion.
Insufficient Sorting Keys.
When sorting data, the order of items with identical keys is undefined. It is still deterministic, but the addition or removal of a single item can cause all items with the same sorting key to be arbitrarily rearranged. To avoid this, always use a full set of sorting keys when sorting data, so as to give every item a specific unique order. Introduce additional sorting keys if necessary, even if you would not normally have a use for them.

Noise reduction aims to ensure that we will never see changes in the audit files unless there have been changes in the code, and that for every unique change in the code we will see a specific expected set of changes in the audit output, instead of a large number of irrelevant changes. This ensures that the single change that matters will not be lost in the noise, and makes it easier to determine that the modifications we made to the code have exactly the intended consequences and not any unintended consequences.

Note that in some cases, noise reduction can be implemented in the tests rather than in the production code. For example, instead of replacing a plain hash table with an ordered hash table in production code, our test can obtain the contents of the plain hash table and sort them before writing them to the audit file. However, this may not be possible in cases where the hash table is several transformations away from the auditing. Thus, replacing a plain hash table with an ordered hash table may sometimes be necessary in production code.

Noise reduction in production code can be either always enabled, or only enabled during testing. The most performant choice is to only have it enabled during testing, but the safest choice is to have it always enabled.

Failure Testing

Failure Testing is the practice of deliberately supplying the component-under-test with invalid input and ensuring that the component-under-test detects the error and throws an appropriate exception. Such scenarios can leverage Audit Testing by simply catching exceptions and serializing them, as text, into the audit output.

Applicability

Audit Testing is most readily useful when the Component Under Test produces results as text, or results that are directly translatable to text. With a bit of effort, any kind of output can be converted to text, so Audit Testing is universally applicable.

Must Audit Files be committed?

It is in theory possible to refrain from storing Audit Files in the source code repository, but doing so would have the following disadvantages:

It would deprive the code reviewer from the convenience of being able to see not only the changes in the code, but also the differences that these changes have introduced in the audit output of the test.
It would require the developer to always remember to immediately run the tests each time they pull from the source code repository, so as to have the unmodified Audit Files produced locally, before proceeding to make modifications to the code which would further modify the Audit Files.
It would make it more difficult for the developer to take notice when the Audit Files change.
It would make it more difficult for the developer to see diffs between the modified Audit Files and the unmodified ones.

Of course all of this could be taken care of with some extra tooling. What remains to be seen is whether the effort of developing such tooling can be justified by the mere benefit of not having to store Audit Files in the source code repository.

Conclusion

Audit Testing is a universally applicable technique for automated software testing which can significantly reduce the effort of writing and maintaining tests by sparing us from having to stipulate our expectations in test code, and from having to go fixing test code each time our expectations change.

Cover image: "Audit Testing" by michael.gr.

2024-04-01

Audit Testing