Musings on the impossibility of testing
I'm always a bit flabbergasted when I see how people test their code (especially with unit tests). Often I can't figure out a cohesive set of rules based on which the tests are written.
Testing seems to work from a pragmatic perspective, we have evidence it works, but different people define "testing" in different ways. I'm awfully afraid that what some people refer to as "testing" is completely unrelated to that unrigorous but pragmatic practice that saves us from production bugs.
I - A contrived example
To showcase my arbitrary formalism as it relates to testing, I think it's worthwhile to look at some cases:
1)
# Testing a multiplication function
func multiply(int a, int b) -> int
func test_1():
assert 2*5 == multiply(2,5)
assert 10*12 == multiply(12,10)
func test_2():
assert 10 == multiply(2,5)
assert 120 == multiply(12,10)
func test_3():
assert type(multiply(2,5)) == int
assert type(multiply) == func(int,int)->int
2)
# Testing an image classification algorithm
func classify(arr[int] image) -> str
func test_1():
assert img_classification_nn(to_arr('cat.jpg')) == classify(to_arr('cat.jpg'))
assert img_classification_nn(to_arr('dog.jpg')) == classify(to_arr('dog.jpg'))
func test_2():
assert 'cat' == classify(to_arr('cat.jpg'))
assert 'dog' == classify(to_arr('dog.jpg'))
func test_3():
possible_targets = ['cat','cheta','tiger','dog','wolf','coyote','other']
assert classify(to_arr('rand_img_1.jpg')) in possible_targets
assert classify(to_arr('rand_img_2.jpg')) in possible_targets
For both examples, test_1
obviously makes no sense. You're testing the functionality of a function using another function that does the same thing. If the function used to assert correct functionality is better (i.e. if *
is better than multiply
or img_classification_nn
is better than classify
), than one should simply use that function instead. If they are equally flawed, then the tests are redundant.
These kinds of tests are a theatrical performance to build up an illusion of quality. Usually spotted in enterprise-grade codebases and teams that harp on about TDD a lot, I'll classify these as redundant checks.
Next, test_2
is more interesting, it's comparing the result of a function with what a human would expect the result of that function to be. Here I would argue the test makes sense for the classify
function but not for the multiply
function.
Computers are good at multiplying, humans are horrible at it. So testing this function against an arbitrary number of human answers is much more likely to reveal flaws with the human's answer rather than the computer's. Since the function is very simple, time would probably better spent reviewing the actual code to make sure it's error-free.
On the other hand, computers aren't very good at classifying images (though they are getting there), but humans are spot on. So using a human's answers to validate a vision algorithm makes perfect sense.
We'll call these types of tests human validations.
Finally, we get to test_3
, which superficially seems less flawed than the other two. It's more of a sanity check than a test, we aren't checking the actual return value of the function in a given case, we are instead checking a higher-level behavior.
These tests are much more common in the codebases of dynamic languages, and for good reason. In our first example, we are just validating the signature of a function, something a compiler will implicitly do for us in any statically typed language.
In the second example, we are instead working around a limitation of our type system. We are checking if the result of our function (a string) is contained within 7 different values (the possible things our function should classify images as), which we could test implicitly by using a language that supports enum
types and writing classify
such that it returns an enum of those 7 classes.
Basically, these kinds of tests can be useful, but they are the kind of things a compiler can handle automatically when they are present they usually indicate a mistake in the language the programmer is using for the project. We'll call these compiler rules.
Obviously, this system is fairly arbitrary, it's just a conclusion I reached after looking at various codebases and talking with various people about their tests, but I find that it works quite well.
To re-iterate, we have:
- redundant checks
- human validation
- compiler rules
II - Compiler rules
Compiler rules are great, but it's a bad sign when they leak into your testing.
Some amount of compiler rules leaking into testing can't be helped. Until the advent of Rust and/or well-performing libraries for safe multi-threading, testing thread safety should have been a requirement in many codebases. Similarly, before modern type system and allocation techniques, one ought to assume various now-pointless memory checks might have fared well within a codebase (and, I assume, still do on some embedded devices).
Even more, switching to more modern languages and compilers is often a task much harder than just writing some tests, but I think these tests should be viewed as a cautionary sign against the language currently used. If they only cover a small but critical percentage of the codebase, everything is fine, if they cover most of it, it's an indicator of using the wrong compile-time tooling.
The one difficult thing about these tests might be spotting them, an inexperienced team might be writing loads of these tests without realizing there are better alternatives to the compile-time toolchain they are using which would replace the need for them.
III - Redundant checks
Redundant checks are often pointless busywork, but they can serve a valuable role in some cases.
The two main scenarios that come to mind are as follows:
Given two versions of the same function, one well-proven and one "experimental", where the "experimental" function has some benefits in terms of performance or generalizability, it seems reasonable to test the "experimental" function using the well-proven approach. Still, the overhead here is so severe I find it difficult to think of a real-world example where this applies.
Given two versions of the same function, one in a "new" codebase and one an "old" codebase, it makes sense to test the "new" version with the "old" version. In this sense, going back to compiler rules, redundant checks can be useful in the process of refactoring, especially for major changes like changing the language or the core framework upon which the project is built. However, this seems useful only as a "temporary" measure rather than a permanent fixture of a project.
That being said, I'd wager that most tests falling in this category, as mentioned before, are there only as "filler" and indicative of a much bigger underlying problem of a team that engages in busywork. If I ever accepted such a test, I would require plenty of complimentary comments to explain why they exist and when they can be removed.
IV - Human validation
Human validation is the "sanest" type of testing one can perform, it can be replaced by good practices or good languages. The problem with human validation comes from the fact that in some cases human intelligence can't solve the problems the software is designed to solve or can't cover all the edge cases the software will encounter.
Take for example banking software. It's fairly easy to imagine what should happen with a piece of banking software that provided dozens of customers, hundreds of transactions, and a few regulatory restrictions. All the logic there can be written down into tests that validate our software. However, the software itself must scale to millions of customers, trillions of transactions, and thousands of regulatory restrictions. Something a human mind can't comprehend coherently in order to write the tests.
Still, banking software is an easy example, because the various components might be test-able and their limited logic might be fully comprehensible by a human, even accounting for all the edge cases. This is because the inputs to the software are very well defined, there are only so many things one can do with a banking API, a finite space which a human mind can explore exhaustively when properly broken down.
One real problem arises when we have software with a very broad and/or poorly define input and/or outputs spaces. A few examples of these would be:
- Scrappers that distill information from a broad range of websites (e.g. the ones used by a search engine)
- Machine learning algorithms meant to work on a broad range of data (e.g. a decision tree or a gradient boosting classifier implementation in a library like sklearn)
- Any software that has to work well with user-provided code extensions (e.g. think a video game-like Skyrim that has to support mods)
- Simulation software (e.g. physics simulations)
- Creative software (e.g. multimedia editing & creation, game engines)
- Compilers
One can separate these into components with input-output spaces that can be easily comprehended by the human mind, but some components that suffer from the above problems are bound to remain. It also introduces the problem of writing code separation in such a way that components can be easily tested, rather than with refactoring, extension, speed, or readability as the main concern.
V - Human validation and cost
Assuming that we've designed our software such that it has many human-comprehensible components, the issue of cost remains. Human-comprehensible is a vague term, there are many things which, given enough time, a person could write exhaustive tests for, but in practice, this often takes more time than we can allocate. Furthermore, in certain situations, even in the limited input&output space scenario, it might still be easier to write the "generative" logic (the one that maps inputs to outputs) than to come up with mappings ourselves.
In certain cases, exhaustive validation can be done, but it can be prohibitively expensive. Banks are once again a good example of this being done in practice. So much is riding on their (relatively simple) software that they can actually pay for this type of exhaustive validation for every component, but even so, it's fairly easy for catastrophic failures to happen (e.g. HSBC failures in 2019 & 2016).
At most one can break down software into components such that a fraction of them will have an input&output space that is limited enough for a human to test. But for this to be worthwhile the "testable" components ought to be among the most critical and or failure-prone bits of the software, otherwise we aren't doing a whole lot to improve reliability. Furthermore, this puts us in a paradigm where we are constructing our code for our tests, rather than constructing tests for our code.
I suspect the answer lies somewhere in a combination of smoke tests, testing common user flows, testing edge cases in which we suspect the software will fail, and testing critical components the failure of which is unacceptable (e.g. would lead to a loss of human lives which aren't covered by our insurance).
So in this paradigm, I'm left at an impasse.
Most testing I see happen is either redundant, indicative of a team using the wrong language or so impossibly hard to design that it becomes a bigger challenge than writing the actual software (when at all possible).
Initially, I was planning on writing a much bigger article on the topic where I delved more into what I see as potential solutions. Look at good heuristics which could construct a coherent picture of what type of testing should be written for any given product, but the more I write the harder the problem looks.
I ended up constructing a different formulation of the above enumeration to serve as followup inducing inspiration:
Canto I
Canto II
Canto III
Canto IV
Published on: 2020-09-11