michael.gr: Incremental Integration Testing

Abstract:

A new method for Automated Software Testing is presented as an alternative to Unit Testing. The new method retains the benefit of Unit Testing, which is Defect Localization, but eliminates white-box testing and mocking, thus greatly lessening the effort of writing and maintaining tests.

(Useful pre-reading: About these papers)

Summary

Unit Testing aims to achieve Defect Localization by replacing the collaborators of the Component Under Test with Mocks. As we will show, the use of Mocks is laborious, complicated, over-specified, presumptuous, and constitutes testing against the implementation, not against the interface, thus leading to brittle tests that hinder refactoring rather than facilitating it.

To avoid these problems, Incremental Integration Testing allows each component to be tested in integration with its collaborators, (or with Fakes thereof,) thus completely abolishing Mocks. Defect Localization is achieved by arranging the order in which tests are executed so that the collaborators of a component get tested before the component gets tested, and stopping as soon as a defect is encountered.

Thus, when a test discovers a defect, we can be sufficiently confident that the defect lies in the component being tested, and not in any of its collaborators, because by that time, the collaborators have passed their tests.

The problem

The goal of automated software testing in general, regardless of what kind of testing it is, is to exercise a software system under various usage scenarios to ensure that it meets its requirements and that it is free from defects. The most simple and straightforward way to achieve this is to set up some input, invoke the system to perform a certain job, and then examine the output to ensure that it is what it is expected to be.

Unfortunately, this approach only really works in the "sunny day" scenario: if no defects are discovered by the tests, then everything is fine; however, if defects are discovered, we are faced with a problem: the system consists of a large network of collaborating software components, and the test is telling us that there is a defect somewhere, but it is unclear in which component the problem lies. Even if we divide the system into subsystems and try to test each subsystem separately, each subsystem may still consist of many components, so the problem remains.

What it ultimately boils down to is that each time we test a component, and a defect is discovered, it is unclear whether the defect lies in the component being tested, or in one or more of its collaborators.

Ideally, we would like each test to be conducted in such a way as to detect defects specifically in the component that is being tested, instead of extraneous defects in its collaborators; in other words, we would like to achieve Defect Localization.

The existing solution: Unit Testing

Unit Testing (W) was invented specifically in order to achieve defect localization. It takes an extremely drastic approach: if the use of collaborators introduces uncertainties, one way to eliminate those uncertainties is to eliminate the collaborators. Thus, Unit Testing aims to test each component in strict isolation. Hence, its name.

To achieve this remarkably ambitious goal, Unit Testing refrains from supplying the component under test with the actual collaborators that it would normally receive in a production environment; instead, it supplies the component under test with specially crafted substitutes of its collaborators, otherwise known as test doubles. There exist a few different kinds of substitutes, but by far the most widely used kind is Mocks.

Each Mock must be hand-written for every individual test that is performed; it exposes the same interface as the real collaborator that it substitutes, and it expects specific methods of that interface to be invoked by the component-under-test, with specific argument values, sometimes even in a specific order of invocation. If anything goes wrong, such as an unexpected method being invoked, an expected method not being invoked, or a parameter having an unexpected value, the Mock fails the test. When the component-under-test invokes one of the methods that the Mock expects to be invoked, the Mock does nothing of the sort that the real collaborator would do; instead, the Mock is hard-coded to yield a fabricated response which is intended to exactly match the response that the real collaborator would have produced if it was being used, and if it was working exactly according to its specification.

Or at least, that is the intention.

Drawbacks of Unit Testing

Complex and laborious

In each test it is not enough to simply set up the input, invoke the component, and examine the output; we also have to anticipate every single call that the component will make to its collaborators, and for each call we have to set up a mock, expecting specific parameter values, and producing a specific response aiming to emulate the real collaborator under the same circumstances. Luckily, mocking frameworks lessen the amount of code necessary to accomplish this, but no matter how terse the mocking code is, the fact still remains that it implements a substantial amount of functionality which represents considerable complexity.

One of the well-known caveats of software testing at large (regardless of what kind of testing it is) is that a test failure does not necessarily indicate a defect in the production code; it always indicates a defect either in the production code, or in the test itself. The only way to know is to troubleshoot. Thus, the more code we put in tests, and the more complex this code is, the more time we end up wasting in chasing and fixing bugs in the tests themselves rather than in the code that they are meant to test.

Over-specified

Unit Testing is concerned not only with what a component accomplishes, but also with every little detail about how the component goes on about accomplishing it. This means that when we engage in Unit Testing we are essentially expressing all of our application logic twice: once with production code expressing the logic in imperative mode, and once more with testing code expressing the same logic in expectational mode. In both cases, we write copious amounts of code describing what should happen in excruciatingly meticulous detail.
Note that with Unit Testing, over-specification might not even be goal in and of itself in some cases, but it is unavoidable in all cases. This is due to the elimination of the collaborators: the requests that the component under test sends to its collaborators could conceivably be routed into a black hole and ignored, but in order for the component under test to continue working so as to be tested, it still needs to receive a meaningful response to each request; thus, the test has to expect each request in order to produce each needed response, even if the intention of the test was not to know how, or even whether, the request is made.

Presumptuous

Each Unit Test claims to have detailed knowledge of not only how the component-under-test invokes its collaborators, but also how each real collaborator would respond to each invocation in a production environment, which is a highly presumptuous thing to do.
Such presumptuousness might be okay if we are building high-criticality software, where each collaborator is likely to have requirements and specification that are well-defined and unlikely to change; however, in all other software, which is regular, commercial, non-high-criticality software, things are a lot less strict: not only the requirements and specifications change all the time, but also quite often, the requirements, the specification, even the documentation, is the code itself, and the code changes every time a new commit is made to the source code repository. This might not be ideal, but it is pragmatic, and it is established practice. Thus, the only way to know exactly how a component behaves tends to be to actually invoke the latest version of that component and see how it responds, while the mechanism which ensures that these responses are what they are supposed to be is the tests of that component itself, which are unrelated to the tests of components that depend on it.
As a result of this, Unit Testing often places us in the all too familiar situation where our Unit Tests all pass with flying colors, but our Integration Tests miserably fail because the behavior of the real collaborators turns out to be different from what the mocks assumed it would be.

Fragile

During Unit Testing, if the interactions between the component under test and its collaborators deviate even slightly from our expectations, the test fails. However, these interactions may legitimately change as software evolves. This may happen due to the application of a bug-fix, due to refactoring, or due to the fact that whenever new code is added to implement new functionality, preexisting code must almost always be modified to accommodate the new code. With Unit Testing, every time we change the inner workings of production code, we have to go fixing all related tests to expect the new inner workings of that code.
The original promise of Automated Software Testing was to enable us to continuously evolve software without fear of breaking it. The idea is that whenever you make a modification to the software, you can re-run the tests to ensure that everything still works as before. With Unit Testing this does not work, because every time you change the slightest thing in the production code you have to also change the tests, and you have to do this even for changes that are only internal. The understanding is growing within the software engineering community that Unit Testing with mocks actually hinders refactoring instead of facilitating it.

Non-reusable

Unit Testing exercises the implementation of a component rather than its interface. As such, the Unit Test of a certain component can only be used to test that component and nothing else. Thus, with Unit Testing the following things are impossible:

Completely rewrite a piece of production code and then reuse the old tests to make sure that the new implementation works exactly as the old one did.
Reuse the same test to test multiple different components that implement the same interface.
Use a single test to test multiple different implementations of a certain component, created by independently working development teams taking different approaches to solving the same problem.

The above disadvantages of Unit Testing are direct consequences of the fact that it is White-Box Testing by nature. What we need to be doing instead is Black-Box testing, which means that Unit Testing should be avoided, despite the entire Software Industry's addiction to it.

Note that I am not the only one to voice dissatisfaction with Unit Testing with Mocks. People have been noticing that although tests are intended to facilitate refactoring by ensuring that the code still works after refactoring, tests often end up hindering refactoring, because they are so tied to the implementation that you can't refactor anything without breaking the tests. This problem has been identified by renowned personalities such as Martin Fowler and Ian Cooper, and even by Ken Beck, the inventor of Test-Driven Development (TDD).

In the video Thoughtworks - TW Hangouts: Is TDD dead? (youtube) at 21':10'' Kent Beck says "My personal practice is I mock almost nothing" and at 23':56'' Martin Fowler says "I'm with Kent, I hardly ever use mocks". In the Fragile Test section of his book xUnit Test Patterns: Refactoring Test Code (xunitpatterns.com) author Gerard Meszaros states that extensive use of Mock Objects causes overcoupled tests. In his presentation TDD, where did it all go wrong? (InfoQ, YouTube) at 49':32'' Ian Cooper says "I argue quite heavily against mocks because they are overspecified."

Note that in an attempt to avoid sounding too blasphemous, none of these people call for the complete abolition of mocks, they only warn against the excessive use of mocks. Furthermore, do not seem to be isolating the components under test, and yet they seem to have little, if anything, to say about any alternative means of achieving defect localization.

A new solution: Incremental Integration Testing

If we were to abandon Unit Testing with mocks, then one might ask what should we be doing instead. Obviously, we must somehow continue testing our software, and it would be nice if we can continue to be enjoying the benefits of defect localization.

As it turns out, eliminating the collaborators is just one way of achieving defect localization; another, more pragmatic approach is as follows:

Allow each component to be tested in integration with its collaborators, but only after each one of the collaborators has undergone its own testing, and has successfully passed it.

Thus, any observed malfunction can be attributed with a high level of confidence to the component being tested, and not to any of its collaborators, because the collaborators have already been tested.

I call this Incremental Integration Testing.

An alternative way of arriving at the idea of Incremental Integration Testing begins with the philosophical observation that strictly speaking, there is no such thing as a Unit Test; there always exist collaborators which by established practice we never mock and invariably integrate in Unit Tests without blinking an eye; these are, for example:

Many of the external libraries that we use.
Most of the functionality provided by the Runtime Environment in which our system runs.
Virtually all of the functionality provided by the Runtime Library of the language we are using.

Nobody mocks standard collections such as array-lists, linked-lists, hash-sets, and hash-maps; very few people bother with mocking filesystems; nobody would mock an advanced math library, a serialization library, and the like; even if one was so paranoid as to mock those, at the extreme end, nobody mocks the MUL and DIV instructions of the CPU; so clearly, there are always some things that we take for granted, and we allow ourselves the luxury of taking these things for granted because we believe that they have been sufficiently tested by their respective creators and can be reasonably assumed to be free of defects.

So, why not also take our own creations for granted once we have tested them? Are we testing them sufficiently or not?

Prior Art

An internet search for "Incremental Integration Testing" does yield some results. An examination of those results reveals that they refer to some strategy for integration testing which is meant to be performed manually by human testers, constitutes an alternative to big-bang integration testing, and requires full Unit Testing of the traditional kind to have already taken place. I am hereby appropriating this term, so from now on it shall mean what I intend it to mean. If a context ever arises where disambiguation is needed, the terms "automated" vs. "manual" can be used.

The first hints to Incremental Integration Testing can actually be found in the classic 1979 book The Art of Software Testing by Glenford Myers. In chapter 5 "Module (Unit) Testing" the author plants the seeds of what later became white-box testing with mocks by writing:

[...] since module B calls module E, something must be present to receive control when B calls E. A stub module, a special module given the name “E” that must be coded to simulate the function of module E, accomplishes this.

then, the author proceeds to write:

The alternative approach is incremental testing. Rather than testing each module in isolation, the next module to be tested is first combined with the set of modules that have already been tested.

Back in 1979, Glen Myers envisioned these approaches to testing as being carried out by human testers, manually launching tests and receiving printouts of results to examine. He even envisioned employing multiple human testers to perform multiple tests in parallel. In the last several decades we have much better ways of doing all of that.

Implementing the solution: the poor man's approach

As explained earlier, Incremental Integration Testing requires that when we test a component, all of its collaborators must have already been tested. Thus, Incremental Integration Testing necessitates exercising control over the order in which tests are executed.

Most testing frameworks execute tests in alphanumeric order, so if we want to change the order of execution all we have to do is to appropriately name the tests, and the directories in which they reside.

For example:

Let us suppose that we have the following modules:

com.acme.alpha_depends_on_bravo

com.acme.bravo_depends_on_nothing

com.acme.charlie_depends_on_alpha

Note how the modules are listed alphanumerically, but they are not listed in order of dependency.

Let us also suppose that we have one test suite for each module. By default, the names of the test suites follow the names of the modules that they test, so again, a listing of the test suites in alphanumeric order does not match the order of dependency of the modules that they test:

com.acme.alpha_depends_on_bravo_tests

com.acme.bravo_depends_on_nothing_tests

com.acme.charlie_depends_on_alpha_tests

To achieve Incremental Integration Testing, we add a suitably chosen prefix to the name of each test suite, as follows:

com.acme.T02_alpha_depends_on_bravo_tests

com.acme.T01_bravo_depends_on_nothing_tests

com.acme.T03_charlie_depends_on_alpha_tests

Note how the prefixes have been chosen in such a way as to establish a new alphanumerical order for the tests. Thus, an alphanumeric listing of the test suites now lists them in order of dependency of the modules that they test:

com.acme.T01_bravo_depends_on_nothing_tests

com.acme.T02_alpha_depends_on_bravo_tests

com.acme.T03_charlie_depends_on_alpha_tests

At this point Java developers might object that this is impossible, because in Java, the tests always go in the same module as the production code, directory names must match package names, and test package names always match production package names. Well, I have news for you: they don't have to. The practice of doing things this way is very widespread in the Java world, but there are no rules that require it: the tests do not in fact have to be in the same module, nor in the same package as the production code. The only inviolable rule is that directory names must match package names, but you can call your test packages whatever you like, and your test directories accordingly.

Java developers tend to place tests in the same module as the production code simply because the tools (maven) have a built-in provision for this, without ever questioning whether there is any actual benefit in doing so. Spoiler: there isn't. As a matter of fact, in the DotNet world there is no such provision, and nobody complains. Furthermore, Java developers tend to place tests in the same package as the production code for no purpose other than to make package-private entities of their production code accessible from their tests, but this is testing against the implementation, not against the interface, and therefore, as I have already explained, it is misguided.

So, I know that this is a very hard thing to ask from most Java developers, but trust me, if you would only dare to take a tiny step off the beaten path, if you would for once do something in a certain way for reasons other than "everyone else does it this way", you can very well do the renaming necessary to achieve Incremental Integration Testing.

Now, admittedly, renaming tests in order to achieve a certain order of execution is not an ideal solution. It is awkward, it is thought-intensive since we have to figure out the right order of execution by ourselves, and it is error-prone because there is nothing to guarantee that we will get the order right. That's why I call it "the poor man's approach". Let us now see how all of this could be automated.

Implementing the solution: the automated approach

Here is an algorithm to automate Incremental Integration Testing:

Begin by building a model of the dependency graph of the entire software system.

This requires system-wide static analysis to discover all components in our system, and all dependencies of each component. I did not say it was going to be easy.
The graph should not include external dependencies, since they are presumed to have already been tested by their respective creators.

Test each leaf node in the model.

A leaf node in the dependency graph is a node which has no dependencies; at this level, a Unit Test is indistinguishable from an Integration Test, because there are no dependencies to either integrate or mock.

If any malfunction is discovered during step 2, then stop as soon as step 2 is complete.

If a certain component fails to pass its test, it is counter-productive to proceed with the tests of components that depend on it. Unit Testing seems to be completely oblivious to this little fact; Incremental Integration Testing fixes this.

Remove the leaf nodes from the model of the dependency graph.

Thus removing the nodes that were previously tested in step 2, and obtaining a new, smaller graph, where a different set of nodes are now the leaf nodes.
The dependencies of the new set of leaf nodes have already been successfully tested, so they are of no interest anymore: they are as good as external dependencies now.

Repeat starting from step 2, until there are no more nodes left in the model.

Allowing each component to be tested in integration with its collaborators, since they have already been tested.

No testing framework that I know of (JUnit, MSTest, etc.) is capable of doing any of the above; for this reason, I have developed a utility which I call Testana, that does exactly that.

Testana will analyze a system to discover its structure, will analyze modules to discover dependencies and tests, and will run the tests in the right order so as to achieve Incremental Integration Testing. It will also do a few other nice things, like keep track of last successful test runs, and examine timestamps, so as to refrain from running tests whose dependencies have not changed since the last successful test run. For more information, see michael.gr - Testana: A better way of running tests.

What if my dependencies are not discoverable?

Some very trendy practices of our modern era include:

Using scripting languages, where there is no notion of types, and therefore no way of discovering dependencies via static analysis.
Breaking up systems into disparate source code repositories, so there is no single system on which to perform system-wide static analysis to discover dependencies.
Incorporating multiple different programming languages in a single system, (following the polyglot craze,) thus hindering system-wide static analysis, since it now needs to be performed on multiple languages and across language barriers.
Making modules interoperate not via normal programmatic interfaces, but instead via various byzantine mechanisms such as REST, whose modus operandi is binding by name, thus making dependencies undiscoverable.

If you are following any of the above trendy practices, then you cannot programmatically discover dependencies, so you have no way of automating Incremental Integration Testing, so you will have to manually specify the order in which your tests will run, and you will have to keep maintaining this order manually.

Sorry, but silly architectural choices do come with consequences.

What about performance?

One might argue that Incremental Integration Testing does not address one very important issue which is nicely taken care of by Unit Testing with Mocks, and that issue is performance:

When collaborators are replaced with Mocks, the tests tend to be fast.
When actual collaborators are integrated, such as file systems, relational database management systems, messaging queues, and what not, the tests can become very slow.

To address the performance issue I recommend the use of Fakes, not Mocks. For an explanation of what Fakes are, and why they are incontestably preferable over Mocks, please read michael.gr - Software Testing with Fakes instead of Mocks.

By supplying a component under test with a Fake instead of a Mock we benefit from great performance, while utilizing a collaborator which has already been tested by its creators and can be reasonably assumed to be free of defects. In doing so, we continue to avoid White-Box Testing and we keep defects localized.

Furthermore, nothing prevents us from having our CI/CD server run the test of each component twice:

Once in integration with Fakes
Once in integration with the actual collaborators

This will be slow, but CI/CD servers generally do not mind. The benefit of doing this is that it gives further guarantees that everything works as intended.

Benefits of Incremental Integration Testing

It greatly reduces the effort of writing and maintaining tests, by eliminating the need for mocking code in each test.
It allows our tests to engage in Black-Box Testing instead of White-Box Testing. For an in-depth discussion of what is wrong with White-Box Testing, please read michael.gr - White-Box vs. Black-Box Testing.
It makes tests more effective and accurate, by eliminating assumptions about the behavior of the real collaborators.
It simplifies our testing operations by eliminating the need for two separate testing phases, one for Unit Testing and one for Integration Testing.
It is unobtrusive, since it does not dictate how to construct the tests, it only dictates the order in which the tests should be executed.

Arguments and counter-arguments

Argument: Incremental Integration Testing assumes that a component which has been tested is free of defects.

A well-known caveat of software testing is that it cannot actually prove that software is free from defects, because it necessarily only checks for defects that we have anticipated and tested for. As Edsger W. Dijkstra famously put it, "program testing can be used to show the presence of bugs, but never to show their absence!'

Counter-arguments:
- I am not claiming that once a component has been tested, it has been proven to be free from defects; all I am saying is that it can reasonably be assumed to be free from defects. Incremental Integration Testing is not meant to be a perfect solution; it is meant to be a pragmatic solution.
- The fact that testing cannot prove the absence of bugs does not mean that everything is futile in this vain world, and that we should abandon all hope in despair: testing might be imperfect, but it is what we can do, and it is in fact what we do, and practical, real-world observations show that it is quite effective.
- Most importantly: Any defects in an insufficiently tested component will not magically disappear if we mock that component in the tests of its dependents.
Argument: Incremental Integration Testing fails to achieve complete defect localization.

If a certain component has defects which were not detected when it was being tested, these defects may cause tests of collaborators of that component to fail, in which case it will be unclear where the defect lies.

Counter-arguments:
- It is true that Incremental Integration Testing may fall short of achieving defect localization when collaborators have defects despite having already been tested. It is also true that Unit Testing with Mocks does not suffer from that problem when collaborators have defects; but then again, neither does it detect those defects. For that, it is necessary to always follow a round of Unit Testing with a round of Integration Testing. However, when the malfunction is finally observed during Integration Testing, we are facing the exact same problem that we would have faced if we had done a single round of Incremental Integration Testing instead: a malfunction is being observed which is not due to a defect in the root component of the integration, but instead due to a defect in some unknown collaborator. The difference is that Incremental Integration Testing gets us there faster.
- Let us not forget that the primary goal of software testing is to guarantee that software works as intended, and that defect localization is an important but nonetheless secondary goal. Incremental Integration Testing goes a long way towards achieving defect localization, but it may not achieve it perfectly, in favor of other conveniences, such as making it far more easy to write and maintain tests. So, it all boils down to whether Unit Testing represents overall more or less convenience than Incremental Integration Testing. I assert that Incremental Integration Testing is unquestionably far more convenient than Unit Testing.
Argument: Incremental Integration Testing only tests behavior; it does not check what is going on under the hood.

With Unit Testing, you can ensure that a certain module not only produces the right results, but also that it follows an expected sequence of steps to produce those results. With Incremental Integration Testing you cannot observe the steps, you can only check the results. Thus, the internal workings of a component might be slightly wrong, or less than ideal, and you would never know.

Counter-arguments:
- This is true, and this is why Incremental Integration Testing might be unsuitable for high-criticality software, where White-Box Testing is the explicit intention, since it is necessary to ensure not only that the software produces correct results, but also that its internals are working exactly according to plan. However, Incremental Integration Testing is not being proposed as a perfect solution, it is being proposed as a pragmatic solution: the vast majority of software being developed in the world is regular, commercial-grade, non-high-criticality software, where Black-Box Testing is appropriate and sufficient, since all that matters is that the requirements be met. Essentially, Incremental Integration Testing represents the realization that in the general case, tests which worry not only about the behavior, but also about the inner workings of a component, constitute over-engineering. For a more in-depth discussion about this, please read michael.gr - White-Box vs. Black-Box Testing.
- In order to make sure that everything is happening as expected under the hood, you do not have to stipulate in excruciating detail what should be happening, you do not have to fail the tests at the slightest sign of deviation from what was expected, and you do not have to go fixing tests each time the expectations change. Another way of ensuring the same thing is to simply:
  - Gain visibility into what is happening under the hood.
  - Be notified when something different starts happening.
  - Visually examine what is now different.
  - Vouch for the differences being as expected.
  For more details about this, see michael.gr - Collaboration Monitoring.
Argument: Incremental Integration Testing prevents us from picking a single test and running it.

With Unit Testing, we can pick any individual test and run it. With Incremental Integration Testing, running an individual test of a certain component is meaningless unless we first run the tests of the collaborators of that component.

Counter-arguments:
- Picking an individual test and running it is meaningless under all scenarios. It is usually done in the interest of saving time, but it is based on the assumption that we know what tests have been affected by the changes we just made to the source code. This is never a safe assumption to make.
- Instead of picking an individual test and running it, we need a way to automatically run all tests that have been affected by the changes we just made, which requires knowledge of the dependency graph of the system.
- If you are unsure as to exactly what you just changed, and exactly what depends on it, then consider using a tool like Testana, which figures all this out for you. See michael.gr - Testana: A better way of running tests.
Argument: Incremental Integration Testing requires additional tools.

Incremental Integration Testing is not supported by any of the popular testing frameworks, which means that in order to start practicing it, new tools are necessary. Obtaining such tools might be very difficult, if not impossible, and creating such tools might be difficult, because they would have to do advanced stuff like system-wide static analysis to discover the dependency graph of a system.

Counter-arguments:
- My intention is to show the way; if people see the way, the tools will come.
- I have already built such a tool which is compatible with some combinations of programming languages, build systems, and testing frameworks; see michael.gr - Testana: A better way of running tests.
- Even in lack of tools, it is possible to start experimenting with Incremental Integration Testing today by following the poor-man's approach, which consists of simply naming the tests, and the directories in which they reside, in such a way that your existing testing framework will run them in the right order. This is described in the "poor man's approach" section of this paper.

Conclusion

Unit Testing was invented in order to achieve defect localization, but as we have shown, it constitutes White-Box Testing, so it is laborious, over-complicated, over-specified, and presumptuous. Furthermore, it is not even, strictly speaking, necessary. Incremental Integration Testing is a pragmatic approach which achieves almost the same degree of defect localization but without the use of mocks, and in so doing it greatly reduces the effort of developing and maintaining tests.

Cover image: Incremental Integration Testing by michael.gr

2022-10-19

Incremental Integration Testing