michael.gr: White-Box vs. Black-Box Testing

I have something blasphemous to tell you.

Unit Testing is wrong.

There, I said it.

I know I just insulted most people's sacred cow.

Sorry, not sorry.

I will explain, bear with me.

(Useful pre-reading: About these papers)

So, what is Unit Testing anyway?

Unit Testing, according to its definition, aims to examine a module in isolation, to make sure that it behaves as expected without uncertainties introduced by the behavior of other modules that it interacts with. These other modules are known as dependencies. To achieve this, the test refrains from connecting the module with its dependencies, and instead emulates the behavior of the dependencies. That is what makes it a Unit Test, as opposed to an Integration Test.

The emulation of the dependencies is meant to be done in a very straightforward and inexpensive way, because if it was complicated, then it would introduce uncertainties of its own. So, if we were to imagine for a moment that the math library is a dependency of the module under test, (just for the sake of the example,) when the module under test asks for the cosine of an angle, the Unit Test does not want to invoke the actual math library to perform the cosine computation; instead, the Unit Test makes sure beforehand to supply the module under test with such inputs that will cause the module to work with a known angle of say 60 degrees, so the Unit Test can anticipate that the module will ask for the cosine of a 60 degree angle, at which point the Unit Test will supply the module under test with a hard-coded value of 0.5, which is known to be the cosine of 60 degrees. The Unit Test then proceeds to make sure that the module does the right thing with that 0.5 and produces the right results.

In doing so, the Unit Test expects the module under test to interact with each of its dependencies in a strictly predetermined way: a specific set of calls is expected to be made, in a specific order, with specific arguments. Thus, the unit test has knowledge of exactly how the module is implemented: not only the outputs of the module must be according to spec, but also every single little detail about the inner workings of the module must go exactly as expected. Therefore, Unit Testing is white-box testing by nature.

What is wrong with White-Box Testing

White-box testing is not agnostic enough.

Just as users tend to test software in ways that the developer never thought of, (the well known "works for me but always breaks in the hands of the user" paradox,) software tests written by developers who maintain an agnostic stance about the inner workings of the production code are likely to test for things that were never considered by those who wrote the production code.

White-box testing is a laborious endeavor.

The amount of test code that has to be written and maintained often far exceeds the amount of production code that is being tested.
Each modification to the inner workings of production code requires corresponding modifications to the testing code, even if the interface and behavior of the production code remains unchanged.
With respect to procedural logic within the module under test, the Unit Test has to make sure that every step of each workflow is followed, so the test essentially has to anticipate every single decision that the module will make. This means that the test duplicates all of the knowledge embodied within the module, and essentially constitutes a repetition of all of the procedural logic of the module, expressed in a different way. This problem has also been identified by others, and it is sometimes called the "over-specified tests problem".

White-box testing suffers from The Fragile Test Problem.

A bug fix in the production code more often than not causes tests to break, which then have to be fixed. Note that this often happens even if we first write a test for the bug, which is expected to initially fail, and to start passing once the bug fix is applied: other previously existing tests will break. Unfortunately, it is often unclear to what extent the tests are wrong, and to what extent the tests are right but the production code suffers from other, dormant bugs, that keep causing the tests to fail. When fixing tests as a result of bug fixes in production, the general assumption is that the production is now correct, therefore the test must be wrong, so the test is often hastily modified to make it pass the existing production code. This often results in tests that "test around" pre-existing bugs, meaning that the tests only pass if the bugs are there.

White-box tests are not re-usable.

It should be possible to completely rewrite a piece of production code and then reuse the old tests to make sure that the new code works exactly as the old one did. This is impossible with white-box testing.
It should be possible to write a test once and use it to test multiple different implementations of a certain module, created by independently working development teams taking different approaches to solving the same problem. This is also impossible with white-box testing.

White-box testing hinders refactoring.

Quite often, refactorings which would affect the entire code base are unattainable because they would necessitate rewriting all unit tests, even if the refactorings themselves would have no effect on the behavior of the module, and would only require limited and harmless modifications to the production code, such as the case is when replacing one third-party library with another.

White-box testing is highly presumptuous.

White-box testing claims to have knowledge of exactly how the dependencies behave, which may not be accurate. As an extreme example, the cosine of 60 is 0.5 only if that 60 is in degrees; if the cosine function of the actual math library used in production works with radians instead of degrees, then the result will be something completely different, and the Unit Test will be achieving nothing but ensuring that the module will only pass the test if it severely malfunctions. In real-world scenarios the wrongful assumptions are much more subtle than a degrees vs radians discrepancy, making them a lot harder to detect and troubleshoot.

In the preface of the book The Art of Unit Testing (Manning, 2009) by Roy Osherove, the author admits to having participated in a project which failed to a large part due to the tremendous development burden imposed by badly designed unit tests which had to be maintained throughout the duration of the development effort. The author does not go into details about the design of those unit tests and what was so bad about it, but I would dare to postulate that it was simply the fact that they were... Unit Tests.

Is white-box testing good for anything?

If you are sending humans to space, or developing any other high-criticality system, then fine, go ahead and do white-box testing, as well as inside-out testing, and upside-down testing, and anything else that you can think of, because in high-criticality software, there is no amount of testing that constitutes too much testing. However, the vast majority of software written in the world today is not high criticality software, it is just plain normal, garden variety, commercial software. Applying space-grade practices in the development of commercial software does not make business sense, because space-grade practices tend to be much more expensive than commercial practices.

In high criticality, it is all about safety; in commercial, it is all about cost effectiveness.

In high criticality, it is all about leaving nothing to chance; in commercial, it is all about meeting the requirements.

What about leaving nothing to chance?

It is true that if you do black-box testing you cannot be absolutely sure that absolutely everything goes absolutely as intended. For example, you may be testing a module to ensure that given a certain input, a certain record is written to the database.

If you do white-box testing, you can ensure not only that the record has the correct content, but also that the record is written once and only once.
If you do black-box testing, all you care is that at the end of the day, a record with the correct content can be found in the database; there may be a bug which inadvertently causes the record to be written twice, and you would not know.

So, at this point some might argue that in promoting black-box testing I am actually advocating imperfect software. Well, guess what: in the commercial sector, there is no such thing as perfect software; there is only software that meets its requirements, and software that does not. If the requirements are met, then some record being written twice is just a performance concern. Furthermore, it is a performance concern not only in the sense of the performance of the running software system, but also in the sense of the performance of your development process: By established practice, it is perfectly fine to knowingly allow a record to be written twice if eliminating this duplication would require too much development work to be worth it, so how is this any different from following an efficient development methodology which might allow that record to be written twice?

This is in line with the observation that nobody aims to write software that is free from imperfections. Virtually every single method that returns a collection in all of Java code written since the dawn of time makes a safety copy of that collection; these safety copies are almost always unnecessary, and yet people keep making them, because they do not want to be concerned with what is safe and what is not safe on a case by case basis; case-by-case addressing of safety concerns is the stuff that bugs are made of. Software that is free of bugs is software that meets the requirements, and that's all that counts.

(Note: personally, I never make safety copies of collections; I use special unmodifiable collection interfaces instead; but that's a different story.)

Conclusion

In the book Design Patterns: Elements of Reusable Object-Oriented Software (Addison-Wesley, 1994) by The Gang of Four (Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides) one of the principles listed is:

Program against the interface, not against the implementation.

Virtually all software engineers agree with this self-evident maxim, and nobody in their right mind would take issue with it. To program against the implementation rather than the interface is universally considered a misguided practice.

In the context of testing, the corollary to this maxim is:

Test against the interface, not against the implementation.

In other words, do black-box testing, not white-box testing.

This is not a unique idea of my own, others have had the same idea before, and have similar things to say. Ian Cooper in his "TDD, where did it all go wrong" talk states that in TDD a Unit Test is defined as a test that runs in isolation from other tests, not a test that isolates the unit under test from other units. In other words, the unit of isolation is the test, not the unit under test. Some excerpts from the talk are here: Build Stuff '13: Ian Cooper - TDD, where did it all go wrong and the full talk is here: TDD, Where Did It All Go Wrong (Ian Cooper, 2017)

Other references:

If not Unit Testing, then what?

So, one might ask: if Unit Testing is wrong, then what should we be doing instead? The original impetus behind the invention of Unit Testing still remains: when we test a module we want to make sure that the observed behavior is not affected by potential malfunction in its dependencies. How can we avoid that?

The way I have been handling this in recent years is by means of a method that I call Incremental Integration Testing. You can read about it here: michael.gr - Incremental Integration Testing.

2021-12-14

White-Box vs. Black-Box Testing