Musings on the impossibility of testing

I'm always a bit flabbergasted when I see how people test their code (especially with unit tests). Often I can't figure out a cohesive set of rules based on which the tests are written.

Testing seems to work from a pragmatic perspective, we have evidence it works, but different people define "testing" in different ways. I'm awfully afraid that what some people refer to as "testing" is completely unrelated to that unrigorous but pragmatic practice that saves us from production bugs.

I - A contrived example

To showcase my arbitrary formalism as it relates to testing, I think it's worthwhile to look at some cases:

1)

# Testing a multiplication function
func multiply(int a, int b) -> int

func test_1():
  assert 2*5 == multiply(2,5)
  assert 10*12 == multiply(12,10)

func test_2():
  assert 10 == multiply(2,5)
  assert 120 == multiply(12,10)

func test_3():
  assert type(multiply(2,5)) == int
  assert type(multiply) == func(int,int)->int

2)

# Testing an image classification algorithm
func classify(arr[int] image) -> str

func test_1():
  assert img_classification_nn(to_arr('cat.jpg')) == classify(to_arr('cat.jpg'))
  assert img_classification_nn(to_arr('dog.jpg')) == classify(to_arr('dog.jpg'))

func test_2():
  assert 'cat' == classify(to_arr('cat.jpg'))
  assert 'dog' == classify(to_arr('dog.jpg'))

func test_3():
  possible_targets = ['cat','cheta','tiger','dog','wolf','coyote','other']
  assert classify(to_arr('rand_img_1.jpg')) in possible_targets
  assert classify(to_arr('rand_img_2.jpg')) in possible_targets

For both examples, test_1 obviously makes no sense. You're testing the functionality of a function using another function that does the same thing. If the function used to assert correct functionality is better (i.e. if * is better than multiply or img_classification_nn is better than classify), than one should simply use that function instead. If they are equally flawed, then the tests are redundant.

These kinds of tests are a theatrical performance to build up an illusion of quality. Usually spotted in enterprise-grade codebases and teams that harp on about TDD a lot, I'll classify these as redundant checks.

Next, test_2 is more interesting, it's comparing the result of a function with what a human would expect the result of that function to be. Here I would argue the test makes sense for the classify function but not for the multiply function.

Computers are good at multiplying, humans are horrible at it. So testing this function against an arbitrary number of human answers is much more likely to reveal flaws with the human's answer rather than the computer's. Since the function is very simple, time would probably better spent reviewing the actual code to make sure it's error-free.

On the other hand, computers aren't very good at classifying images (though they are getting there), but humans are spot on. So using a human's answers to validate a vision algorithm makes perfect sense.

We'll call these types of tests human validations.

Finally, we get to test_3, which superficially seems less flawed than the other two. It's more of a sanity check than a test, we aren't checking the actual return value of the function in a given case, we are instead checking a higher-level behavior.

These tests are much more common in the codebases of dynamic languages, and for good reason. In our first example, we are just validating the signature of a function, something a compiler will implicitly do for us in any statically typed language.

In the second example, we are instead working around a limitation of our type system. We are checking if the result of our function (a string) is contained within 7 different values (the possible things our function should classify images as), which we could test implicitly by using a language that supports enum types and writing classify such that it returns an enum of those 7 classes.

Basically, these kinds of tests can be useful, but they are the kind of things a compiler can handle automatically when they are present they usually indicate a mistake in the language the programmer is using for the project. We'll call these compiler rules.

Obviously, this system is fairly arbitrary, it's just a conclusion I reached after looking at various codebases and talking with various people about their tests, but I find that it works quite well.

To re-iterate, we have:

redundant checks
human validation
compiler rules

II - Compiler rules

Compiler rules are great, but it's a bad sign when they leak into your testing.

Some amount of compiler rules leaking into testing can't be helped. Until the advent of Rust and/or well-performing libraries for safe multi-threading, testing thread safety should have been a requirement in many codebases. Similarly, before modern type system and allocation techniques, one ought to assume various now-pointless memory checks might have fared well within a codebase (and, I assume, still do on some embedded devices).

Even more, switching to more modern languages and compilers is often a task much harder than just writing some tests, but I think these tests should be viewed as a cautionary sign against the language currently used. If they only cover a small but critical percentage of the codebase, everything is fine, if they cover most of it, it's an indicator of using the wrong compile-time tooling.

The one difficult thing about these tests might be spotting them, an inexperienced team might be writing loads of these tests without realizing there are better alternatives to the compile-time toolchain they are using which would replace the need for them.

III - Redundant checks

Redundant checks are often pointless busywork, but they can serve a valuable role in some cases.

The two main scenarios that come to mind are as follows:

Given two versions of the same function, one well-proven and one "experimental", where the "experimental" function has some benefits in terms of performance or generalizability, it seems reasonable to test the "experimental" function using the well-proven approach. Still, the overhead here is so severe I find it difficult to think of a real-world example where this applies.
Given two versions of the same function, one in a "new" codebase and one an "old" codebase, it makes sense to test the "new" version with the "old" version. In this sense, going back to compiler rules, redundant checks can be useful in the process of refactoring, especially for major changes like changing the language or the core framework upon which the project is built. However, this seems useful only as a "temporary" measure rather than a permanent fixture of a project.

That being said, I'd wager that most tests falling in this category, as mentioned before, are there only as "filler" and indicative of a much bigger underlying problem of a team that engages in busywork. If I ever accepted such a test, I would require plenty of complimentary comments to explain why they exist and when they can be removed.

IV - Human validation

Human validation is the "sanest" type of testing one can perform, it can be replaced by good practices or good languages. The problem with human validation comes from the fact that in some cases human intelligence can't solve the problems the software is designed to solve or can't cover all the edge cases the software will encounter.

Take for example banking software. It's fairly easy to imagine what should happen with a piece of banking software that provided dozens of customers, hundreds of transactions, and a few regulatory restrictions. All the logic there can be written down into tests that validate our software. However, the software itself must scale to millions of customers, trillions of transactions, and thousands of regulatory restrictions. Something a human mind can't comprehend coherently in order to write the tests.

Still, banking software is an easy example, because the various components might be test-able and their limited logic might be fully comprehensible by a human, even accounting for all the edge cases. This is because the inputs to the software are very well defined, there are only so many things one can do with a banking API, a finite space which a human mind can explore exhaustively when properly broken down.

One real problem arises when we have software with a very broad and/or poorly define input and/or outputs spaces. A few examples of these would be:

Scrappers that distill information from a broad range of websites (e.g. the ones used by a search engine)
Machine learning algorithms meant to work on a broad range of data (e.g. a decision tree or a gradient boosting classifier implementation in a library like sklearn)
Any software that has to work well with user-provided code extensions (e.g. think a video game-like Skyrim that has to support mods)
Simulation software (e.g. physics simulations)
Creative software (e.g. multimedia editing & creation, game engines)
Compilers

One can separate these into components with input-output spaces that can be easily comprehended by the human mind, but some components that suffer from the above problems are bound to remain. It also introduces the problem of writing code separation in such a way that components can be easily tested, rather than with refactoring, extension, speed, or readability as the main concern.

V - Human validation and cost

Assuming that we've designed our software such that it has many human-comprehensible components, the issue of cost remains. Human-comprehensible is a vague term, there are many things which, given enough time, a person could write exhaustive tests for, but in practice, this often takes more time than we can allocate. Furthermore, in certain situations, even in the limited input&output space scenario, it might still be easier to write the "generative" logic (the one that maps inputs to outputs) than to come up with mappings ourselves.

In certain cases, exhaustive validation can be done, but it can be prohibitively expensive. Banks are once again a good example of this being done in practice. So much is riding on their (relatively simple) software that they can actually pay for this type of exhaustive validation for every component, but even so, it's fairly easy for catastrophic failures to happen (e.g. HSBC failures in 2019 & 2016).

At most one can break down software into components such that a fraction of them will have an input&output space that is limited enough for a human to test. But for this to be worthwhile the "testable" components ought to be among the most critical and or failure-prone bits of the software, otherwise we aren't doing a whole lot to improve reliability. Furthermore, this puts us in a paradigm where we are constructing our code for our tests, rather than constructing tests for our code.

I suspect the answer lies somewhere in a combination of smoke tests, testing common user flows, testing edge cases in which we suspect the software will fail, and testing critical components the failure of which is unacceptable (e.g. would lead to a loss of human lives which aren't covered by our insurance).

So in this paradigm, I'm left at an impasse.

Most testing I see happen is either redundant, indicative of a team using the wrong language or so impossibly hard to design that it becomes a bigger challenge than writing the actual software (when at all possible).

Initially, I was planning on writing a much bigger article on the topic where I delved more into what I see as potential solutions. Look at good heuristics which could construct a coherent picture of what type of testing should be written for any given product, but the more I write the harder the problem looks.

I ended up constructing a different formulation of the above enumeration to serve as followup inducing inspiration:

Canto I

Midway through the series A, I find myself in realms untrodden, Forced a choice of testing methodology.

Ah me! How hard a thing it is to say How boring a bikeshed I thought it to be, Which made it all the harder to get right.

Then was my fear a little quieted, Remembering that I had many implemented Which worked for long and served me steady.

But my team, which was fleeing onward, Turned them into things more standard Which dreadfully I fear are naught but waste.

While rejecting so many of PRs, Before mine eyes did one present himself, Who seemed a pragmatist in craft and spirit.

Through me, the way is to the enterprise codebase, Through me, the way is to enduring specs, Through me, the way is to the empty coverage.

"Now, art thou that Rossum of the famous GIL Shepherded of modern scientific coding?" I made a response to him with a bashful forehead.

He lead me to a waste with no created thing Only a sea of corrupt corporate coding, "Abandon all coherence, those who enter here"

The middle manager, with his eyes of glede, Demanding of them endless lines of testing, Restructuring for those who lag behind.

Took us to a place where nothing ever shines.

Canto II

Thus we descended into the first approach, Of writing tests as though one's a compiler Validating types and asserting signatures.

There standeth the senior standup master horribly, Judges and grades PR based on percentage coverage Demands for more and scolds when they are lacking.

These tests exist to build up an illusion, Of software quality in-spite of its shortcomings Yet ever could a type system replace them.

Alas in here there'll be no compilers, Other than those of ages long forlorn So maybe it is best these test remain

But never should they see the light of day.

Canto III

The place where to descent the bank we came, Was twisted and bent with not an ending A labyrinth which sprawled, ever up and down

Yet with the helo of my noble guide, We reached the plains of our second method Which bares a name I shall translate redundant

Redundant checks are pointless busywork, Though sparsely value they may hold, And this method diverges into twin branches.

One where our code is more experimental, And our tests are well-known implementations To validate our similar creations.

We strive for speed, performance, beauty, Yet know the old and ugly oft holds truth Thus it our path assures as one that's right

Yet this way can often be misleading A siren call for pointless maintenance, Of two parallel code codebases, equally wrong.

Upon the second stream, Our codebase has new logic, But which is meant only to kill the old

Yet to make sure the fight is fair, The older can riposte with all remaining vigor, To warn us of mistakes and shortcomings

The old logic is slow but good for testing, The new logic is quick, but yet unproven Thus the former validates and the later is production.

Yet most of testcases comes here, Should still not be assumed to have a usage Written under a whip of false rigor

A large percentage covered, yet with no more than it's own skin

Canto IV

Just when we thought the horizon lost, A hill we spotted far off in the distance From which we saw a wall made out of code

Code that was functional, imperative, OO Code functioning, inactive, and disordered The tongues of it a flurry unimagined.

This is the sanest method of them all, The one which sits in the Pragmatic city The third, the one, of human validation.

It works, but so within mysterious limits, Things which are tiny both in in and out Banking, accounting, and the CRMs.

Yet one looks up to simulators and compiler, They see generic algorithms and VMs Agast the sky is with kernels and creative software.

For all of these the input and the output, Are far too large for humans to behold Only the code itself can truly understand.

I saw the hoards of testers writing fast, Into smaller components ever splitting So that our forms can head to understanding.

Yet some logic's too fragile, The components too brittle One can't just separate forever,

And even if we did, the testers are expensive.

Published on: 2020-09-11

Reactions: