The second hardest thing in programming - Part 1

There are only two hard things in Computer Science: cache invalidation and naming things.

I don't remember when I heard that, but it stuck. When a quip "clicks" for me, I can't rest until I find a framing in which it's disproved. But it's been 6 years now, and I'm still unable to muster a reply to the above other than "Yeah, yeah, it seems so".

I'm pretty sure cache invalidation is the harder of the two, because of some underlying OCD most of us have about picking two arbitrary constants that dictate sweep frequency and TTL. I'd even go as far as to speculate most of the evolution of programming has been a weird spiritual exercise around trying not to pick those two arbitrary numbers. Functional programming as a whole can even be understood as a set of taboos around this, because what good is shared OCD if it doesn't lead to religion.

I digress, I'm neither old, wise nor crazy enough to talk about cache invalidation, so I'll talk about the second hardest thing in programming instead, naming things.

i - Naming necessity

Maybe I'm being vague here, but please bear with me, I am trying to introduce the practice that leads to the creation of civilization and distinguishes us from apes

Naming helps one make sense of code.

That's not quite right... naming gives sense to code.

In theory, names could be arbitrary, at the ASM level it's more than reasonable to replace names with addresses. Yet names are so useful, that they persist even when code is compiled, presumably because they help preserve sanity when one is forced to look at the generate ASM.

In so far as names help make sense of code, they operate at different levels.

Names help connect the developer with the end goal of the software. I think the main reason OO became popular as a paradigm for teaching was a culture of example programs that used names referencing the "real world". Using words like "Shop" and "Transaction" and "buy" and "transfer_to", that give the student's bored brain some bit of reality to hang on to.

Names help in building a mental map of software. In this regard, names function much like any other arbitrary category we instil upon the world, be it those that we use to refer to animals, other people, the night's sky or, to make the similarity to code more obvious, the content of one's bowel movement. In this sense, names should optimize for a map of the code that makes sense to the one writing it, you.

I'm just kidding though, the "you" here is really "a team" and the "a team" here is really "a team with people coming and going, people which have imperfect memory and ever-changing minds with which they read that map".

We can already see some tradeoffs here. Names that will help you now are not necessarily ones that will help the team 5 years from now. Maybe let IHateThisFingLoop is a useful outlet for one's anger or val steves_moms = FAT32() provided some much needed comedic relief. But naming for the moment can often backfire. Not only when naming for catharsis. We have such a good map of the code we're working on now that an "obvious" name can be a horrible choice in hindsight. Depending on our mental context, the name trx might be a much-welcomed shorthand for write_new_credentials_transaction, or it might be the cause of a security error that makes the news.

Names aren't there just to express the concepts we already have, once a name is chosen, if encountered often enough, it becomes its own concept. One need only looks at the old sciences to see arbitrary variable names that are now rooted in people's mindsdb as describing the fundamental nature of reality in an irreducible way. pi and e as representations of the circle, x as describing the concept of unkown, c as describing the maximum speed an object can travel with if Maxwell's equations are to hold... etc).

If a perfectly-named codebase were to exist, it would provide the reader with an amazing understanding of the things it's used for, in addition to being easy to grok. The sad thing is that "perfectly-named" is something that varies between people and even within people (over time).

I also think that the "experience" people have with names might vary greatly based on the codebases they worked on. Working in a large codebase with loads of existing names and naming standards provides a completely different naming-experience from building something from scratch.

Indeed, understanding a codebase or even a language can probably be boiled down to being familiar with all of its names and naming convention.

There might be a type of person that can remember all the names of the functions in a stdlib and yet know nothing about a language. But, in spite of our education system trying to optimize for the psychiatric illness which would allow this, for the vast majority of people understanding a language still boils down to knowing the vocabulary.

ii - Naming convention

The foreplay to naming things is coming up with conventions about naming things. A convention dictates the boundaries of what names one can give in various situations. For example, the conventions I usually impose are:

class and struct names should be CammelCase
function and variable names should be snake_case
constants (in the constexpr sense), enum members and global aliases should be ALL_CAPS_SANKE_CASE
function only used inside the file they are in have names starting with _
names should be >1 and <20 letters, >0 and <5 words, exceptions are allowed
If the type of a variable is not obvious from usage, have a name that implies the type (e.g. customer_arr, equation_dict)

None of it is written down in our coding guidelines, people just sort of "catch onto it", even first-time contributors, I find this fascinating.

There are many things new people seem to miss that I have to re-iterate time and time again, but naming is never one of them. Nor was it ever a problem for me when joining a new team to pick up on their conventions.

Though maybe this ease of adoption would disappear if the conventions were too niche or too many?

Conventions are useful for two major reasons:

They reduce the thought space when searching for a good name.
They add meaning to existing names without making them longer.

If you want to understand naming go digging for conventions, but due to the above issue (people catch onto them instinctively), good conventions are hard to find. Some conventions were so good they got codified into language syntax.

Did you know const (i.e. immutability) wasn't a thing in any popular programming language until the early 80s when C++ came along? People (presumably) used to write it as part of variable names and hope that it would be respected by convention.

The idea of objects and classes are essentially a mix of naming and file-placement conventions that got mixed, at least in imperative land.

Even more so, one suspects that "types" were originally a mere naming convention, though the asteroid destroyed most of the evidence that could be used to conclude that with certainty. But nowadays "types" seem to be used as part of names in language lacking a type system.

iii - Naming history

But, asks the reader of 2050, I heard that back in your days there as a field called "mathematics", a thing humans did before computers, where they tried (and often failed miserably) to use their brains to execute formal logic.

You are quite perceptive in remaking that, and I agree we can't understand naming in programming while ignoring 3000 years of naming in mathematics. The most basic names in programming, those shared between most languages, those of the operators (+,-,*,/,^,& ...etc) are pulled out or inspired from math.

In my arbitrarily chosen view of the world, mathematics was an imperfect tool with imperfect creators built in a time before modern brains and modern machines, thus it's riddled with flaws and limitations. One of the most obvious flaws in the way things were named.

When using "math notation" people tended to prefer very short names, namely 1 symbol long. A programmer might write something akin to:

function calc_quarterly_interest(principal, rate, quarters):
    return principal*rate*time

Though the most obsessive might go all the way to writing:

function calculate_quarterly_interest(principal, rate, quarters):
    quarterly_returns = multiply(principal, rate)
    return multiply(quarterly_returns, time)

However, in math notation, it would be considered bad form to write anything longer or more expressive than

i(p,r,q)=p*r*q

The reason for this, I presume, boils down to two things:

Saving paper, which could often be quite expensive and impossible to erase.
Reducing the amount of writing in materials like sand or clay, which are cheap and easy to erase, but difficult to write in.

This didn't cause many issues because our brains are not very good at executing formal logic. So a given mathematical construct might have included 2, 3, 5 maybe 10 entities playing around. But to postulate an equation with thousands of variable probably seemed like madness even to a genius like Euler or grand curator like Euclid.

Of course, we live in an age where a mildly talented 8-year-old can pick up a toy language like Scratch and construct such an equation incidentally while writing a web app. Nowadays we need only write the equations, historically our brains were also responsible for executing them. This restricted the realm of possibilities to one so tiny I shudders to think about the lamentable condition of the poor souls that helped us get to where we could build computers.

Still, the reason why the previously mentioned interest computing function would work is that the writer could simply specify beforehand: "p stands for principal, r for rate, q for number of quarters".

This is a practice that remained with us until the 80s in a weird way, the name of variables used to be declared at the beginning of a file before they were initialized. Though it may seem crazy to you, C and C++ allow you to compile the following code:

int a;

a = 5;

though in literally all cases we nowadays use:

int a = 5;

This used to be the preferred way to initialize things in ancient times, or so I'm told, and one has to think it might have, in a twisted way, arisen from the way mathematics separated "definition" from "usage" for its variables.

Maybe I'm being a bit droll here, so let's move the example to a function that computes savings with compounded interest:

function calc_quarterly_compounded_savings(principal, rate, quarters):
   for i in 0..quarters:
       principal += principal*rate
   return principal

Here we see an interesting quirk that greatly influences how we name things, reassigning values to a variable.

I must confess that I'm unsure why this isn't done in. It seems like a potentially good practice, but I'm surprised it's so good that people stumbled upon it 3000 years ago and it stuck.

Either way, this does seem to have some remarkable effects upon the way we name things, consider the above function but written as:

function calc_quarterly_compounded_savings(principal, rate, quarters):
    let gains = 0
    for i in 0..quarters:
        gains += (principal+gains)*rate
    return principal+gains

One could argue that adding gains to our logic is pointless and cumbersome.

Another could argue that by adding gains to the above we moved from an abstract function to an intuitive explanation of one of the fundamental concepts behind modern monetary systems.

All of this because we sacrificed some simplicity (added a variable) but gained a concept (gains). We may also postulate some middle ground such as:

function calc_quarterly_compounded_savings(principal, rate, quarters):
    let accumulated = principal
    for i in 0..quarters:
        accumulated += accumulated*rate
    return accumulated

The above maintains a distinction between the idea of "principal" and "gains" presented in the second implementation while preserving the elegant logic of the first.

But that "pointless" assignment let accumulated = principal, interestingly enough, seems closely related to what I described before as the staple of mathematical notation, separation of definition and usage, but with a twist.

As an aside, mathematics would, of course, elegantly solve this problem by introducing another layer of abstraction (powers) and saying:

i(p,r,q)=p*(1+r)^n

Though in practice this approach seems to often come up against limits (see early 20th century).

Once again, I digress, can we attribute the distinction between mathematics and programming to be in part that programming grew up with "free" paper while math's infancy was in a time where paper was expensive? Not fully I think, but it's a factor we shouldn't overlook.

I find it interesting that, now that Rust and Scala have become a thing, functional languages have (mainly) migrated from motivating their choices with safety to motivating them purely on an aesthetic basis.

iv - Naming context

I don't want to dig too deep into the whole math/functional aesthetic, but in terms of naming, I think they are quite different. Mathematics ended up tabooing names longer than one symbol, and thus found itself with hundreds of weird signs, each of which might have dozens or even hundreds of different definitions depending on the field it was used in.

If I am to judge functional programming by the standard libraries and software written in Haskell, Clojure and Elixir, I could easily claim their names are often quite expressive and fairly long.

Surprisingly, short names are more often than not found in imperative code, just look up a few C++ or C codebases at random and you're bound to see a slew of 1-3 character entities harkening your confusion.

Might there be a reason for this? Probably not, but allow me to speculate one:

Short names require us to have a "mental map" of each variable inside a "working cache" or sort, you have to be able to instantly map p to principal in one context, to prediction in another and to partition in yet another.

So in that sense, short-names might serve as a safeguard for limiting the size of a given block of logic, after all, this temporary mental map can only hold a few symbols and still be efficient.

On the other hand, especially when on a flow, we may be tempted to use expressive names in order to write monstrously sized blocks of logic. Functional programming seems to have other ways of safeguarding against this style, so that may have caused short names to be more a staple of imperative languages?

Another interesting thing about short names is that, due to the aforementioned mental map required to work with them, they may serve as an interesting way of switching between mental modes of operations.

Personally, when some piece of logic becomes too complex, the steps I go through are:

grief
switching to short (1-3 chart) notation
despair
separation

I think there might be some strength to the idea that short notations are better for thinking about algorithmic problem-solving.

It might also be that they so often get used in more "traditional" and "mathematically related" areas of CS (as well as physics) due to habit or math-envy.

The answer may also lie at the border of habit and intrinsic quality, most people might be taught to separate verbal communication and logic at a young age (for obvious reasons) and certain queues (e.g. seeing 1-letter variables) may inclined one towards thinking logically/mathematically while others (e.g. seeing distinguishable words) might invoke a more prosaic mood.

v - Naming people

But, assume for a second that, for whatever reason, shorthand names allowed for more clever code. Should we use them?

I don't think so. Whoever will have to read those afterwards will have the extra difficult job of having to be in a special "state of mind" to make sense of the code, a state of mind that might be difficult or annoying to attain.

That's not to mention that you yourself might be forced to read that code, and the you of the future might be dumber, more tired or just lack the time or excitement.

Actually, it's rather curious to see how names differ depending on the kind of team working on a codebase. This might be a sweeping generalization, but I tend to see longer and more detailed names in more "enterprisey" environments and shorter ones in more "hobbyistic" or "startupy" environments.

Someone writing: purchase_customer_transaction_instantiation.CollectAllInventoryFromCustomerShoppingCart versus buy_trx.get_cart_content is a fairly reliable indicator as to whether they are working in a tiny company or a publically traded behemoth.

Larger teams and larger companies usually operate within harder constraints. The codebases tend to be more complex and older, there's a lot of people working on them and they've likely passed through many pairs of hands to reach this point. The employees usually cover a more diverse spectrum in terms of backgrounds and skills. Refactoring becomes harder to do, regardless of complexity, because of within-company dependencies and the need to update the mental model of the codebase for a larger team.

Long names help with all of those things. As I hinted at before, "knowing" a codebase is akin to knowing all the names. More verbose names seem like a wonderfully obvious ways of making sure everyone is "on the same page" in regards to what something is doing.

But is there a downside to this practice?

I can't point to one, but it rubs me the wrong way. On a surface level, it feels patronizing. The names of things are os "obvious" that it in part takes away the fun of "getting" the context I'm in or the challenge of coming up with expressive (i.e. small size, high information) names.

Shorter names allow for more efficient operation, provided that you have a context-appropriate mapping cached in working memory.

Even so, shorter names have the downside of requiring to be familiar with the potential "contexts" inside a codebase and how they interact in order to build this map.

Should I start using very long names that instantly provide all relevant context?

From a felt and aesthetic point of view, I want the answer to be no. From a reductionist point of view, I can see a lot of good straightforward arguments for longer names and just a bunch of fuzzy reasons for preferring short ones.

So maybe I should bais myself a bit more towards longer names?

vi - Naming synthesis

I've explored a few things regarding names here, but overall I'm more confused than when I started writing this. The only thing I can say with certainty is:

There are only two hard things in Computer Science: cache invalidation and naming things.

But I can summarize a few potentially useful ideas, namely:

Every codebase has naming conventions and they are often implicit, people have little problem picking up on them.
Context is very important for determining a good name, and the ability of your team to "context switch" might lead to widely different naming schemas.
Shorter names require larger contexts in order to make sense
Longer names seem, on the surface, superior. But (something something) short names might help with problem-solving and code partitioning.
The best naming practices already made it into a language, if you have a naming practice you always use it might be a good guide for languages you might like.

That's it really, I'm ashamed at how short this list is.

I think there are four topics that I'd be interested in exploring more in a followup to this post.

One is the link between IDE-like tooling and names. For example, could emojis and colours be used when naming? I'm thinking, for example, having short yet expressive names that don't trigger a "verbal reasoning mode" via emojis. Or demarcating context or certain properties via colour (in part IDEs already do this with their highlighting). Is anyone experimenting with this? If so, please let me know.

Secondly, I'd be curious to run a name-analysis on popular open-source codebases and try to gather some aggregate statistics as well as look for patterns... maybe even find a fun way to use some hugging face models to extract some insight out of them.

Third, I'd like to run a temporal change analysis on various codebases with a long git history to see how names progress as the codebase and number of maintainers grow.

Fourth, I've been curious for a while to try an experiment where I provide a "math" and a "code" version of the same function to a bunch of people that are used to working with both representation and asking a few basic questions about it (how would you make it more efficient/shorter, how would you modify it to do x... etc). But I'm afraid selecting the demographic here is difficult and expensive (math undergrads that went into a master/PhD program in CS or ML might work...).

I'm awfully sorry for naming some of the elephants in the room without providing proper closure, but I'm afraid doing so is nigh on impossible for a topic as global as naming.

Published on: 2021-03-01

Reactions: