Intro

You should run your integration tests repeatedly every night to catch those bugs that only happen every once in a while.

That is a recommendation you hear now and again in software engineering circles. But to some it can simultaneously feel inadequate and like total overkill. One can reasonably ask, "Our tests already pass, is it really worth the engineering effort to set them up to run repeatedly on every single commit? Even if we do, we still don't have any guarantee that we've caught every bug, so what exactly do we stand to gain here? And why the focus on integration tests over unit tests anyway?"

What are we even doing here?

Terminology

OK, first some definitions. I use the term unit test as clarified in Ian Cooper's talk TDD, Where Did It All Go Wrong. A "unit" is essentially the smallest user-visible behavior, which is in some contexts a single API call. He argues, and I'm inclined to agree, that tests below this level are pointless. And an integration test is a test of multiple units, e.g. a series of API calls.

Not everyone agrees with Mr. Cooper that tests below the unit level are pointless. I have heard it argued that sometimes you should test below the level of the user-visible surface because there are some components that are buried too deep in the system. It's so complicated to write out the user interaction that would reveal this component's behavior that in the interest of practicality you should test the component directly, mocking out other internal interfaces if you have to. That sounds to me like it's probably a symptom of poor system design, but I'm sure there are some cases where it's unavoidable. The guy was a senior engineer at JPL after all, so he's seen some ghoulish problems.

There is of course much debate about the terminology of unit and integration tests, but I will leave that for the footnotes. [1] [2]

Determinism

Generally speaking, as a test covers more of the system, the test will become less deterministic [3]. A single API call might have a handler that runs in a single thread, writes to a database, then returns a success status. And that test will run the same way 100% of the time. But if you then read that value back in the test and aren't syncing the data right in the application, you might sometimes get back a stale value. Or, a command to a single microcontroller in a distributed system might run totally deterministically. But if you simultaneously command a second microcontroller connected to the first, messages between the two microcontrollers could be dropped when tasks are scheduled just so.

To make this observation more precise, if a test exercises code running in multiple contexts (threads, processes, machines), then scheduling is a stochastic input and the test is not deterministic, which is to say that race conditions are possible. Syscalls can take different amounts of time, or a memory access can make a huge cache miss and cause all the threads to be scheduled differently.

This is a characteristic difference between integration tests and unit tests: unit tests tend to be deterministic, and integration tests tend to be flaky.

Now, ideally your program should behave right under all possible schedules, so that the nondeterminism inside the application resolves deterministically to a correct outcome. And then all your integration tests would always pass. But of course, this is planet Earth, and they won't. So what is to be done?

Repeated Test Runs

Here's where the idea of repeatedly running all your integration tests overnight comes in. And maybe your unit tests too if you're not confident in their determinism. By doing that, you'll at least catch some bugs that happen randomly.

But anyone who has worked in a setting with nightly test runs knows that's not the whole story. It happens all the time that you have a test that flakes one time in a thousand, you have no idea why, and you can't seem to reproduce it on your machine where you could gather more information. This is why you run the tests nighly and on every commit. So that if a test starts flaking, you know exactly the change that caused it.

Sometimes that's still not enough information to find the bug. A change in one part of the system can cause a totally unrelated test to start flaking. Then you have to decide what to do next, which is where the statistics produced by the nightly run are really nice. You don't just learn that there is an input that can reveal a bug, you get that input and an associated probability distribution of possible outcomes.

If the test is a simple pass/fail, you just get a Bernoulli distribution, and you can only know that it flakes e.g. 3% of the time. And if that's tolerable and you're in a rush, you can decide, on the basis of that evidence, to just leave the bug. Some testing frameworks have a way to mark a test as flaky. But if you have good error messages, e.g. "Expected 100 bytes, received 97", rather than "Received wrong number of bytes", you get a more informative probability distribution.

In fact, I'm using this sort of information now as I'm dealing with a flaky integration test on my Master's thesis. The distribution of the number of bytes dropped over a radio link seems to be bimodal from my own observations, but after a nightly run I'll know for sure. And if it is, I'll be on the lookout for the possibility that there are two bugs here rather than one.

You can also have your program log the heck out of everything it does. Either manually or via a tracing framework, like an RTOS with SystemView support. That will help you get some of the bugs, at the cost of having to set up the tracing in the first place. But some bugs will be resistant to this information, and others will disappear as soon as you add logging that changes the timing of things.

The bottom line is, nightly runs are not a silver bullet. There's a tradeoff where you have to spend more engineering effort on tracing and good tests to find and fix more bugs. And it's always going to be a guessing game where you can't know for sure the true cause of a failed run because you can't know exactly what happened on that run.

Unless, you could...?

Deterministic Testing

Sounds crazy, right? But people find ways of turning nondeterministic tests into deterministic tests. One half-way solution building on the idea of "log the heck out of everything" is to run your nondeterminstic test, record all the stochastic inputs to it (e.g. scheduler and system call outcomes), so that you can re-play the sequence of inputs later and reproduce the bug. You can do that also at the level of the virtual machine, or for embedded systems, more ad-hoc solutions at the HAL or peripheral level.

If you want to go all the way, you can mock those nondeterministic components out and have total control over your test. For certain cases with a small state space, you can manually enumerate all the interesting scenarios, for example flipping a few random bits in a transmission between two MCUs to see that your ECC is working right. There are open-source plaforms that make this easier.

People have mocked out the nondeterministic components at the level of the pthreads API, and even the hardware itself via the hypervisor. Interestingly, no one has mocked out the whole Linux syscall set.

You'll notice in reading about deterministic testing that "fuzzing" and "state-space exploration" always factor in too. That's because once you have full control over the scheduler, the input required to fully define a schedule, I/O, and whatever else, is so complex and contingent that you can't get anywhere manually. You need machine-generated input, which is where fuzzing comes in.

Fuzzing and State-space Exploration

Fuzzing is just finding program inputs that make it crash. Or, in our case, for the test to fail. When we made the test deterministic, we really just added another set of program arguments determining the nondeterministic component. That is what we fuzz on.

There are all sorts of ways of choosing the inputs when we're fuzzing. We can do it in a black-box/open-loop fashion, where we choose totally random inputs, or choose randomly from among inputs that seem especially likely to trigger edge cases, like adding one second sleeps pseudo-randomly in the schedule.

However, you can be smarter about what portion of the state space you explore if you close the loop. You can come up with certain metrics that are a "hint" to the fuzzer, so that you can say programmatically that input X is exercising the program under test in substantially different way than the last 10 inputs, and the fuzzer should try more like it. This is called "gray-box fuzzing". You start with a "seed" and then produce mutations on it, guided by your metrics. AFL is one method of generating these mutations.

Many guidance metrics are based on code coverage or the call graph. One metric came up in this conversation between Yaron Minsky and Will Wilson which I thought was interesting. The fuzzer can prioritize cold paths through the callgraph. Meaning, if a function usually takes the true branch of an if statement, you should really try to see what happens in the false branch. Those paths are not being explored as much, and developers might not have thought them through. Or, you can look at function arguments. If a function normally takes an integer less than 100, but it's passed 1000 at some point, that's probably an interesting code path.

Rather than being clever about how you mutate the seed, you can be clever about what seed you start at. You can record the real path that the system takes in one run and use that as a starting point on which to generate mutations so that the inputs are more realistic. That's done for fuzzing on normal program inputs too.

Finally, you can constrain the acceptable path of your program explicitly by specifying function properties (e.g. arg x < 100). You can enforce them either by putting asserts everywhere like LLVM does, or via more specialized methods. It's even possible to prove them sometimes via refinement types or dependent types. If you're super crazy, you can bust out Lean or try to copy seL4. Minsky has this nice idea of "brittleness" as something desired, so that when your system is wrong by a little it's wrong by a lot and you can catch bugs easier. The more properties you specify, the more brittle your system. But we're already way in the weeds here.

Conclusion

We're in the weeds because hardly anyone uses totally determinstic testing, property-based testing, or formal methods. These techniques are still pretty academic. Why? They're a lot of work. So why do we care about them? Besides being really fun at parties?

It's good to understand deterministic testing so that that we have a good mental picture of what we're doing and what we're not doing when we run integration tests, one-off or nightly. We're only sampling from a probability distribution. And if you are aware that there are particular sources of nondeterminism in your system (i.e. the stochastic input which is combined with the original test input to determine program output), you have a useful framework for describing your hypotheses about the problem. It helps you say, "I think the timing of the button press has nothing to do with the problem, so I can forget it for now". Or, "I think that the exact timing of when the workqueue is scheduled has something to do with the problem, so I should record every time it runs"

I got thinking about the connection between nightly test runs and deterministic testing after listening to that conversation between Minsky and Wilson and then wrestling that flaky integration test of my Master's thesis I mentioned earlier. I realized I should run this flaky test like 100 times and get a probability distribution before sending the AI at fixing it. Then I had an a-ha about precisely what nighly test runs are good for: offering a compromise between the practicality of nondeterministic integration tests and the informativeness of deterministic integration tests.

Footnotes

[1] This engineer actually called tests below the user-visible level unit tests and when he used the term integration test he lumped into the term both the tests I call unit tests and the tests I call integration tests.

I don't even really have a name for these tests of non-user visible behavior (e.g. testing a function which is a helper for the API request handler function). But I think I should. Maybe "ephemeral test"? The term should be something that communicates that the test could become not applicable after a refactor, that encourages its eventual removal and emphasizes that the test is a bit of a hack.

[2] In some frameworks, like Cargo for Rust, the distinction between integration test and unit test is made by what testing facility is used and where in the workspace the test code sits. User visibility doesn't even factor in.

[3] What do we mean by "deterministic" vs "non-determinisic" tests? The universe is determinstic (excepting miracles), so aren't all tests deterministic? Well, yes. But a flaky integration test is non-deterministic with respect to the arguments we pass in. You can look at it as a DFA with a starting state and then two input streams, one that we pass in, and one that is passed in by circumstance. The test is deterministic with respect to the Cartesian product of the two inputs, but if we don't know the circumstantial input, we have to consider all possible circumstantial inputs to fully describe the system. So we have to consider a bunch of DFAs in parallel. And what is that but an NFA?

Please send comments to blogger-jack@pearson.onl.