This blog is no longer active, you can find my new stuff at:,,, and

Training our humans on the wrong dataset

I really don't want to say that I've figured out the majority of what's wrong with modern education and how to fix it, BUT

1. We train ML models on the tasks they are meant to solve

When we train (fit) any given ML model for a specific problem, on which we have a training dataset, there are several ways we go about it, but all of them involve using that dataset.

Say we’re training a model that takes a 2d image of some glassware and turn it into a 3d rendering. We have images of 2000 glasses from different angles and in different lighting conditions and an associated 3d model.

How do we go about training the model? Well, arguable, we could start small then feed the whole dataset, we could use different sizes for test/train/validation, we could use cv to determine the overall accuracy of our method or decide it would take to long... etc

But I'm fairly sure that nobody will ever say:

I know, let's take a dataset of 2d images of cars and their 3d rendering and train the model on that first.

If you already have a trained model that does some other 2d image processing or predicts 3d structure from 2d images, you might try doing some weight transfer or using part of the model as a backbone. But that's just because the hard part, the training, is already done.

To have a very charitable example, maybe our 3d rendering is not accurate enough and we've tried everything but getting more data is too expensive. At that point, we could decide to bring in other 2d to 3d datasets and also train the model on that and hope there's enough similarity between the two datasets that the model will get better at the glassware task.

One way or another, we'd try to use the most relevant dataset first.

2. We don't do this with humans

I'm certain some % of the people studying how to implement basic building blocks (e.g. allocators, decision trees, and vectors) in C during a 4-year CS degree end up becoming language designers or kernel developers and are glad they took the time to learn those things.

But the vast majority of CS students go on to become frontend developer or full-stack developers where all the "backend" knowledge they require is how to write SQL and how to read/write from file and high-level abstractions over TCP or UDP sockets.

At which point I ask something like:

Well, why not just teach those people how to make UIs and how to write a basic backend in flask?

And I get a mumbled answer about something-something having to learn the fundamentals. To which I reply:

I mean, I don't think you're getting it, the things you are teaching haven't been fundamental for 20+ years, they are hard as fuck. I know how to write a half-decent allocator or an almost std-level vector implementation and I know how to make a basic CRMS and I can't imagine what sort of insane mind would find the former two easier to learn. I also can't imagine the former two being useful in any way to building a CRMS, other than by virtue of the fact that learning them will transfer some of the skills needed to build a CRMS.

At which point I get into arguments about how education seems only marginally related to salary and job performance in most programming jobs. The whole thing boils down to half-arsed statistics because evaluating things like salary, job performance, and education levels is, who would have guessed, really hard.

So for the moment, I will just assume that your run of the mill angular developer doesn't need a 4 year CS degree to do his job and a 6-month online program that teaches the direct skills required is sufficient.

3. Based on purely empirical evidence, I think we should

Going even further, let's get into topics like memory ordering. I'm fairly sure these would be considered fairly advanced subjects as far as programming is concerned, to know how to properly use memory ordering is to basically write ASM code.

I learned about the subject by just deciding one day that I will write a fixed size, lock-free, wait-free, thread-safe queue that allows more multi-reader, multi-writer, or both... then to make it a bit harder I went ahead and also developed a Rust version in parallel.

Note: I'm fairly proud of the above implementations since I was basically a kid 4 years ago when I wrote them. I don't think they are well tested enough to use in production and they likely have flaws that basic testing on an x86 and raspberry PI ARM processor didn't catch. Nor are they anywhere close to the most efficient implementations possible.

I'm certain that I don't have anywhere near a perfect grasp of memory ordering and "low level" parallelism in general. However, I do think the above learning experience was a good way to get an understanding equivalent to an advanced university course in ~2 weeks of work.

Now, this is an n=1 example, but I don't think that I'm alone in liking to learn this way.

The few people I know that can write an above half-arsed compiler didn't finish the Dragon book and then started making their first one, they did the two in parallel or even the other way around.

I know a lot of people who swear by Elm, Clojure, and Haskell, to my knowledge none of them bothered to learn Category theory in-depth or ever read Hilbert, Gödel or Russel. This didn't seem to stop them from learning Haskell or becoming good at it.

Most ML practitioners I know don't have a perfect grasp of linear algebra, they can compute the gradients for a simple neural network by hand or even speculate on the gradients resulting from using a specific loss function, but it's not a skill they are very confident in. On the other hand, most ML practitioners I know are fairly confident in writing their own loss functions when it suits the problem.

That's because most people learning ML don't start with a Ph.D. in applied mathematics, they start playing around with models and learn just enough LA to understand(?) what they are doing.

Conversely, most people that want to get a Ph.D. in applied mathematics working on automatic differentiation might not know how to use TensorFlow very well, but I don't think that's something which is at all harmful to their progress. Even though the final results will find practical applications in libraries over which Tensorflow might serve as a wrapper.

Indeed, in any area of computer-based research or engineering people seem comfortable tackling hard problems even if they don't have all the relevant context for understanding those problems, they have to be, one lifetime is enough to learn all the relevant context.

That's not to say you never have to learn anything other than the thing you are working on, I'd be the last person to make that claim. But usually, if you have enough context to understand a problem, the relevant context will inevitably come up as you are trying to solve it.

4. Why learning the contextual skills independently is useful

But say that you are a medieval knight and are trying to learn how to be the most efficient killing machine in a mounted charge with lances.

There are many ways to do it, but hopping on a horse, strapping you spurs and charging with a heavy lance attached your warhorse (the 1,000kg beast charging at 25km/h into a sea of sharp metal) is 100% not the way to do it.

You'd probably learn how to ride first, then learn how to ride fast, then learn how to ride in formation, then learn how to do it in heavy armor, then how to do it with a lance... etc

In parallel, you'd probably be practicing with a lance on foot, or by stabbing hey sacks with a long blunt stick while riding a poney.

You might also do quite a lot of adjacent physical training, learn to fight with and against a dozen or so weapons and types of shield and armor.

Then you'd spend years helping on the battlefield as a squire, a role that didn't involve being in the vanguard of a mounted charge.

Maybe you'd participate in a tourney or two, maybe you'd participate a dozen mock tourneys where the goal is slightly patting your opponent with a blunt stick.


Because the cost of using the "real training data" is high. If I would be teleported inside a knight's armor strapped to horse galloping towards an enemy formation with a lance in my hand I would almost certainly die.

If someone with only 20% of a knight's training did so, the chance of death or debilitating injury might go down to half.

But at the end of the day, even for the best knight out there, the cost of learning in battle is still a, say, 1/20th chance of death or debilitating injury.

Even partially realistic training examples like tourneys would still be very dangerous, with a small chance of death, a small but significant chance of a debilitating injury, and an almost certainty of suffering a minor trauma and damage to your expensive equipment.

I've never tried fighting with blunt weapons in a mock battle, but I have friends who did and they inform me you get injured all the time and it's tiresome and painful and not a thing one could do for a long time. I tend to believe them, even if I was wearing a steel helmet, the thought of being hit over the head full strength with a 1kg piece of steel is not a pleasant one.

On the other hand, practicing combat stances or riding around on a horse involve almost zero risk of injury, death, or damage to expensive equipment.

The knight example might be a bit extreme, but even for something as simple as baking bread, the cost of failure might be the difference between being able to feed yourself or starving/begging for food for the next few weeks.

The same idea applied and to some extent still applies to any trade. When the stakes involve the physical world and your own body "training on real data" is prohibitively expensive if you're not already 99% fit for the task.

5. Why the idea stuck

We are very irrational and for good reason, most of the time we try to be rational we fail.

If for 10,000 years of written history "learning" something was 1% doing and 99% learning contextual skills we might assume this is a good pattern and stick to it without questioning it much.

Maybe we slowly observe that in e.g. CS people that code more and learn theory less do better, so our CS courses go from 99% theory and 1% coding to 25/75. Maybe we observe that frontend devs that learn their craft in 10 weeks are just as good interns as people with a CS degree, so we start replacing CS degrees for frontend developers with 6-month "boot camps".

But what we should instead be thinking is whether or not we should throw away the explicit learning of contextual skills altogether in these kinds of fields.

I will, however, go even further and say that we sometimes learn contextual skills when we'd be better of using technology to play with fake but realistic "training data".

Most kids learn science as if it's a collection of theories and thus end up thinking of it as a sort of math-based religion, rather than as an incomplete and always shifting body of knowledge that's made easier to understand with layers of mathematical abstraction and gets constantly changed and refactored.

This is really bad, since many people end up rejection is the same way they'd reject a religion, refusing to bow down before the tenured priesthood. It's even worst because the people that "learn" science usually learn it as if it were a religion or some kind of logic-based exercise where all conclusions are absolute and all theories perfectly correct.

But why do we teach kids about theory instead of letting them experiment and analyze data, thus teach them the core idea of scientific understanding? Because experiments are expensive and potentially dangerous.

Sure, there's plenty of token chemistry, physics, and even biology experiments that you can do in a lab. However, telling an 8th-grade class:

Alright kids, today we are going to take nicotine and try to design an offshoot that it's particularly harmful to creatures with this particular quick in their dopaminergic pathways.

It sounds like a joke or something done by genius kids inside the walls of an ivory tower. Not something that can be done in a random school serving a pop-2,000 village in Bavaria.

But why?

After all, if you make enough tradeoffs, it doesn't seem that hard to make a simulation for this experiment. Not one that can be used to design industrial insecticides mind you. But you can take a 1,000 play-neurons model of an insect brain then create 500 different populations, some of which have a wired quickly that causes arrhythmia when certain dopamine pathways are overly stimulated.

You can equally well make a toy simulation for mimicking the dopaminergic effects of nicotine derivative based on 100 simple to model parameters (e.g. expression of certain enzymes that can destroy nicotine, affinity to certain sites, ease of passing the blood-brain barrier).

You're not going to come up with anything useful, but you might even stumble close to an already known design for a neonicotinoid insecticide.

The kids or teachers needn't understand the whole simulation, after all that would defeat the point. They need only be allowed to look at parts of it and use their existing chemistry and biology knowledge to speculate on what change might work.

Maybe that's too high of a bar for a kid? Fine, then they need only use their chemistry knowledge to figure out if a specific molecule could even exist in theory, then fire away as many molecules as possible and see how the simulation reacts. After all, that sounds like 99% of applied biochemistry in the past and potentially even now.

This analogy breaks down at some point, but what we have currently is so skewed towards the "training on real data is a bad approach" extreme that I think any shift towards the "training on real data is the correct approach" would be good.

The shift from "training of real data" being expensive and dangerous is a very recent one, to be charitable it might have happened in the last 30 years or so with the advent of reasonably priced computers.

Thus I think it's perfectly reasonable to assume most people haven't seen this potential paradigm shift. However, it seems like the kind of swim or sink distinction that will make or break many education systems in the 21st century. In many ways, I think this process has already started, but we are just not comprehending it fully as of yet.

Other potentially tangential things I wrote on this topic:

Published on: 2020-06-21



twitter logo
Share this article on twitter
 linkedin logo
Share this article on linkedin
Fb logo
Share this article on facebook