This blog is no longer active, you can find my new stuff at:,,, and

Audio version

Should theories have a control group

It is a commonly held belief that natural selection is a process by which random modifications to a type of organism accumulate over time, based on their ability to increase their chances of survival and reproduction.

It is also a commonly held belief that an omnipotent god created all life as we know it according to this grand design.

Given the circular nature of metaphysics, I think we can just as well postulate a 3rd hypothesis:

All of the seemingly “random” changes that sustain evolution were dictated by god’s grand design, however, god was such a mediocre designer that we can’t separate his “grand design” from what would be just “random mutations”.

I find this third theory for the emergence of organic life to be far more spiritually satisfying than the other two.


Recently I stumbled upon this paper, which basically shows that various state of the art techniques for neural architecture search do no better than random.

It’s pretty experimentally thorough as far as ML papers go, which is to say not very better than most. They ran benchmarks on two play datasets which are used in a lot of papers (including many of the NAS papers): PTB and CIFAR-10. Also, they seem to have at least attempted to control for some of the inherent randomness:

we run 10 experiments with a different initialization of the sampling policy.

Again, this might not sound like an ideal experiment from which one can draw conclusive statements, but this is the ML literature I’m working with… as far as I’m concerned these guys are trying too hard.

Also, the reddit discussion includes a bunch of people citing various papers and personal experiences that have similar claims, but I couldn’t be bothered to follow up on those links.

Regardless, I think there’s a good reason why they are trying so hard to use best experimental practices, because the claims they are making, taken at face value, seem quite outrageous:

The most popular architecture search methods (DARTS, NAO, ENAS and Bayesian NAS) are no better than searching randomly. Not in the overall accuracy of the best model, not in the accuracy variation depending on the initial condition and (obviously) not in the time that it takes to actually train.

Is it quite the same as trying to throw away the work of a few dozens of Ph.D. thesis down the drain? Maybe not, they do seem to suggest some ways of addressing the issue, or at least they identify some root causes. But I’m unsure, I’m not an expert in this area… and apparently, most expert built algorithms are outperformed by coin tossing, so are there even experts in this area ?.

Maybe there are certain datasets where NAS algorithms do perform better than random, but regardless, I want to use this as an entry point into a broader question: “Should we and can we use random controls to confirm theory validity?”

Theory as a search tool

It seems that one of the chiefly roles of scientific theory, when applied to the real world, is very similar to the role of NAS algorithms, it’s a tool for limiting search space.

Indeed, this can be said of almost any theory, other than those modeling things like mathematics and formal logic. In those cases, the theory is basically everything there is to it, there’s no underlying reality.

But for most theories, instead of using them to predict the world without checking, we often predict the results of experiments, then run those experiments to validate that the results are true.

For example, we might predict theoretically, that given 2cm extra of fuselage on our plane: its top speed will go down by 50km/h ,its resistance to impacts with swallows will increase to 4,788J/m^2, it’s behavior when encountering crosswinds of speed x will now be y… etc.

But usually, at least as far as I’ve been led to believe, all of these theoretically sound conclusions are thoroughly tested. First by experimenting on individual components, then on the fuselage as a whole in a wind tunnel, then by building a few test planes and actually flying them.

Granted, I think there are ways of doing research that don’t fall under this paradigm, but I think that many do. To put it in more abstract terms, you have something like:

  1. Given solution search space S, use theory A in order to pick an ideal solution
  2. Run experiments to validate the ideal solution.
  3. If the solution fails, try theory B, or some variation in theory A, go back to step 1… rinse and repeat until step 2 succeeds OR.
  4. If all theory fails, try using your experimental data and previous experiment data to build a new theory, then go back to step 1.
  5. Give up when you’ve run out of funding.

We have the theory acting upon the search space to find a solution and implicitly an experiment that will help validate said solution.

What would a control for this process look like ?

Well, something like:

  1. Given solution search space S, sample a solution randomly.
  2. Run experiments to validate the selected solution.
  3. If the solution fails, go back to step 1 (make sure to sample a different solution this time).
  4. Give up when you’ve run out of funding.

The one problem here is how you go about defining S. In our airplane example, if we are looking for a solution to “how do we make this more resistant to impacts with birds”; if S contains all possible ideas, including: “Paint it with tiger stripes” or “Add a cotton candy machine”. Then we will be sampling arbitrary nonsense forever.

In a realistic scenario S would have to be defined as something like “The search space of all easy-to-implement changes which based on common sense and have some relationship with the desired outcome”.

But, that might still allow for a solution like “Make the fuselage thicker by 50cm”. So I think it’s warranted to add something like: “And where the magnitude and range for any given parameter is closely matched to the magnitude and range of that parameter or a very similar parameter as used in other related experiments”.

Or, less formally: “The search space S contains all common sense solutions which can be implemented and tested quickly and easily given the current budget” Where “common sense” is defined by “similar to things tried in other experiments aiming at similar topics” or “implying a very small variation from the current setup” or “the obvious possibility of a causal relationship between the parameter being tweaked and the outcome”.

Regardless of which of the definitions you prefer, I believe that, at least from a high vantage point, this search space can be defined using simple heuristics.

Would this actually work ?

So, for the sake of argument, let’s assume that whenever a given theory tells us to run a specific experiment, we also run a random control experiment, can we gain any insight out of this ?

Well… maybe, it’s hard to tell, since a lot of experiments have foregone conclusions, there are cases where a theory has close to ideal predictive power.

For this method to shine through you’d need to specifically select for “theories with a lot of edge cases”. But if someone was to actually apply this kind of method, that’s the kind of theory they would try to target anyway. I don’t think anybody is contesting the accuracy of inorganic chemistry or statics on the niches of the world they’ve claimed as their own.

Also, assuming some degree of “common sense”, or at least knowledge of all previous related experiments is required to determine S, who would be tasked with determining it ? Wouldn’t the experimenters be biased towards making S as unreasonably wide as possible, in order to defend from the possibility of their theory being just a statistical artifact ?

Again, maybe, but this kind of argument could be used to invalidate most scientific research, we have to assume that on average the vast majority of work is done with “pure” intentions.

Finally, what should you do assuming that this method actually works ? Assuming that you do end up randomly sampling solutions and randomness does about as well as theory.

Would anyone act on this ? I wouldn’t think so. After all, many fields are unable to make good predictions about the systems they pretend to be modeling (see economics, sociology, psychology) and that has proved to be no detraction.

Though it might be that physicists would act with a bit more care then economists if they heard “Random people on the street are able to make forecasts about your object of study about as well, if not better, than your theories”.

Another interesting idea that stemmed from playing with this is whether or not certain theories could be “immune” to being controlled for.

Intuitively it seems that only a “bad” theory would offer no possibility of a random control. In that, the control assumes there is some search space and some problem, however, you could have a theory where the problem is the theory itself.

In other words, you could be looking for data purely to validate or expand the very theory that’s being used to dictate the experiments being done. In this case, randomly searching would be obviously worse. But also, this kind of endeavor seems very flawed to begin with.

Overall I’m pretty uncertain about this idea. It’s one of those things that seems so obvious that I’m surprised I haven’t heard it mentioned by anyone before, at least not in a broader context, in statistics, and ML controlling against random is a pretty familiar concept. It might be that it only works with an oversimplified definition of “theory” and “experiment” and it couldn’t be applied to a real research environment.

If you enjoyed this article you might also like:

Published on: 2020-04-24



twitter logo
Share this article on twitter
 linkedin logo
Share this article on linkedin
Fb logo
Share this article on facebook