Replacing statistics with modern predictive models
Most fields of scientific inquiry rely on classical statistics to make inferences about the world based on experimental data.
What I mean by "classical statistics" differs from modern machine learning methods (modern predictive models) in the following ways:
- It makes assumptions about the world (other than homogeneity) without any observations to back up those assumptions. A good example of this is the central limit theorem.
- It uses models that were designed in absence of computers, artificially limited by what people could calculate by hand, or with primitive machinery. Thus often failing to extract most of the information available in the data. For more on that issue, see this post.
- In part due to the previous 2 points, it doesn't account for data contamination. It fails to make a distinction between fitting/training and validation/testing data.
- As a lesser failure mode of point 3, it fails to cross-validate its models. I.e. it fails to try and re-fit the models on a different sample of the data and check if it holds on to the remaining data.
I believe this is rather problematic. It has lead to a replication crisis, where the vast majority of research in "social science" seems to fail replication and sometimes can be falsified by simply taking a better look at the original data.
In the more epistemic and less politicized scientific fields, where the incentives are presumably better, this seems to be somewhat checked by stronger data and better models, however a huge amount of time may still be wasted on manul-modelling, interesting correlations might be missed, and spurious correlation lead to time wasted on experimenting towards a dead end.
Furthermore, the complexity of the statistical apparatus being used leads to the general public being unable to understand scientific research. I think this is the rational seed that then develops into the idiotic superstition-based distrust of science which leads to things such as over 30% of the US population not wanting to get vaccinated against COIVD-19.
I hope I can provide an easy to understand example of what a prediction-based approach to scientific inference would look like. The main reasons I champion this approach, which I hope to make evident throughout, are:
- Simplicity; It can be explained to an average 14-year-old (without obfuscating its shortcomings)
- Inferential power; It allows for more insight to be extracted out of data
- Safety against data manipulation; It disallows techniques that permit clever uses of "classical statistics" to generate fallacious conclusions
Furthermore, what I'm proposing here is not at all novel. Many of the studies I happen to read seem to have caught on and do use machine learning models to make predictive power the basis of their findings. But what I would argue for here, is that this predictive methodology should be more standardized and made a requirement for passing peer review.
i - The basic predictive model
Provided that we have some observations (X), which I'll also call features, and we want to correlate them with another set of observations (Y), which are also called target. How would we go about finding this correlation?
In a perfect world, we would find the most accurate function (m) that generates the value of Y based on X. Our predictive power score would be Y - m(X)
, the difference between the "real" observations and those inferred by our function based on X.
This invites 3 questions:
How do we find this function?
How do we make sure this function generalizes to future observations?
How do we compute the difference between the predictive values and the real values?
In answering these 3 questions we can observe why predictive modelling is superior to classical approaches. Other issues will remain (e.g. how do we deconfound), but I will address these afterwards.
ii - Finding a model
In order to find this function (m) that best approximates Y given X we simply have to train a very powerful machine learning model as exhaustively as possible.
The "idealized" form of this is searching the space of all possible functions for the best m. The "down to earth" form of this is using techniques that have generated good predictive models in the past.
A linear or logistic regression, the method of choice for "classical statistics", is one such technique, but we ought to extend this space to techniques that have proven to often (or even always) yield better results. To me, the obvious candidates here would be:
- Various types of gradient boosting (such as XGBoost, CatBoost and LightGBM)
- Various "meta" algorithms that efficiently train many quick underlying models and select the best or ensemble the top (based on the openML benchmarks, I think h2oautoml is currently the best, but many alternatives exist, for example, auto-sklearn and pycaret)
While gradient boosters could be viewed as a very easy to train subset of neural networks, using a broader set of neural networks (or other automatic differentiation based techniques) might yield better results in theory, since they support much more complex operations, including those that would be needed for a model in chemistry and physics, as well as processing very-large subfeatures (e.g. images, video, text and data with strong temporal relationships). I'm unaware of a technique that generalizes well enough to not require a team of specialists though, a library I work on (lightwood) attempts to achieve this, but thus far it's still too limited and underperforms in too many edge cases for me to whole-heartedly recommend it.
Alas, any of the techniques listed above allow for searching a wider space of equations than what is permitted by "classical" statistical models and thus is bound to lead to better or at worst equivalently good insights.
iii - Making sure it generalizes
But, says anyone that's been following, how do we ensure this model generalizes. After all, we might well be overfitting the data, finding correlations that don't generalize.
Enter cross-validation. The "strongest" form of cross-validation can be defined in pseudocode as follows:
# Iterate over all observations
errors = 0
for x, y in X, Y:
# Remove the single observations x and y from the vectors
Xs = X.remove(x)
Ys = Y.remove(y)
#Based on the remaining observations (all but the one we removed) find the best possible model `m`
m = search_best_model(Xs, Ys)
predicted_y = m(x)
error = error_function(predicted_y, y)
errors.append(error)
# The resulting "mean_error" stands in for the predictive power, where "0" is the best predictive power possible
mean_error = mean(errors)
In plain English, we take out a single observation and find the function (m) based on all remaining observation, then we check how well m
performs in predicting the observations we've taken out, in other words, we compute the error. Then we repeat this process with every single observation, the resulting mean error is how badly our model does.
In practice, this can work for small datasets, ones that only have dozens of thousands of observations, and to be fair that should cover the vast majority of datasets. If we want to apply this to bigger datasets, we can simply split our data into a few subsets (also known as "folds") and find m
using all but one of these subsets, then compute the error on that subset. This can result in getting a worst predictive power score, but if the experiment proves important enough to warrant finding the best possible power score available, a supercomputer can be used to rerun the modelling. However, speaking from practical experience, this shouldn't make much of a difference once we get to a number like 20 or 30 folds.
Again, the only downside here is that we might miss out on some findings, however we aren't detecting any "fake" insights.
This method is limited by homogeneity assumptions, that future data will be "similar" to our observations, i.e. that we didn't collect a biased sample, but this is a fundamental problem of science and it can't be avoided under any paradigm.
However, it doesn't require other often-faulty assumptions needed by classical models, such as independence, linearity, independence of errors and assumptions about the "expected shape" of the distribution of data or distribution of errors.
iv - The error function
The one missing element here is the error function, i.e. answering the question "How different is our inference from reality".
In the case of an unbiased categorical target (e.g when the thing we are trying to infer boils down to a set of distinct categories, for example: "true" or "false") this is very easy. We have an error of 1
when the prediction doesn't match reality and of 0
when it does.
In the case of a biased categorical target, this becomes harder. Since we must figure out how to address that bias.
If we want to "evenly" address the bias this is easy, we must use a balanced accuracy function. To illustrate how this works:
Assume you have 100 observations for your target, 85 are A
, 15 are B
and 5 are C
. But in reality, we might see a 1/1/1 split, or equivalently, we might care 17 times as much about being correct about C
than about A
, and 5.6 times as much about being correct about B
than about A
.
Then the error function we use can be written in pseudocode as:
if reality == A:
error = 0 if prediction==A, else (1-85/100)
if reality == B:
error = 0 if prediction==B, else (1-15/100)
if reality == C:
error = 0 if prediction==C, else (1-5/100)
Note: Our "maximum" mean error will now be < 1, but we can normalize the above function to avoid this, however that would make the code less readable.
More broadly, the above error function and functions like it can serve to bias (or unbias) or model towards any observation we want. For example, if we are trying to predict a very common, curable and horrible disease, we might want to bias our model in an unbalanced way, since the downside is very small (further investigations to confirm the diagnosis) and the upside is very large (getting a cure for a horrible disease).
The function gets more complex when we care about the relationships between variable (e.g. predicting A
when reality is C
is really, really bad, but predicting B
when reality is C
doesn't cause much of an issue).
For numerical values, the problem is equally complex. In practice classical methods end up relying on a small set of functions, such as RMS error, so we can always just use those.
However, an important thing to note, is that currently employed statistical methods already use such an error function, either implicitly or explicitly. So again, the worst-case scenario is doing just as badly as current methods, not worst.
When people look at a correlation based on a linear regression (i.e. in most cases) they are implicitly using an RMS (root mean squared) error function.
So if well-established error functions for studying certain phenomenon exists, we can keep using them.
This does however invite a dreadful question on any to-be scientific enterprise:
What do we actually care about in trying to build this theory?
By making the error function explicit, rather than making it implicit to the models, this remove a bunch of "hidden" bias from results. This is very good.
An example:
Assume a hypothetically racist scientist is trying to modify an existing arrest-making algorithm to justify arrests taking into account the suspect's skin colour. This is very much a real problem, and currently, this can be masked by the researcher saying:
This racial-profiling model improves the correlation between arrests made and thefts prevented (p<0.05), thus it can be viewed as a net improvement over previous methods. <A lot of blabbering and somewhere in the methodology section we find out the correlation is improved from 0.1 to 0.104
But the above statement could mean something like:
This updated version of the model that takes into account skin pigmentation results in 1200 arrests compared to the baseline of 1000. Out of those 1200, 125 are justified, while out of those 1000, 100 are justified
By having to resort to describing his loss function the researcher would have to make a few biases bare:
- Arresting 100 people unjustly for every 1.04 thief caught is a worthwhile goal.
- Catching 24% more thefts justifies arresting 20% more people.
- Improving just-arrests by 4% is worth the potential runoff consequences of segregating citizen's right by their phenotype.
If any of those assumptions wouldn't hold, then the resulting loss function might well determine the racially biased model is worst. For example, the loss function might dictate that catching a smaller % of thefts is justified if fewer people are arrested, which might be a real goal (after all, policing theft may well be viewed as being about introducing a "risk" to prevent runoff thefts, not catching thieves, which do minimal damage in the grand scheme of things).
This is all hypothetical, the point is that these biases MUST be laid bare, especially in research that affects policy, and they can be hidden away inside "peer validate statistical models".
v - P-values
One instrument that is lost here, you may notice, is the p-value. Predictive models don't have p-values.
This is a feature, not a bug, p-values are an artefact of nonsensical statistical models (e.g. a T-test) which the vast majority of scientists don't understand and are impossible to explain truthfully to the general public.
The "standard" explanation that most people seem to use is that they represent the chance of a correlation being spurious, e.g. due to a low sample. In technical terms, this is utter bullshit, it relies on assumptions about the distribution of errors that have no equivalence in the real world.
The useful things which p-values are trying to express can however be extracted from predictive models. These are:
- Sample size, i.e. the number of observations.
- The (discrete) error distribution based on the cross-validation.
The sample size is something that "normal" people that don't have an inclination towards mathematics can relate to. This is doubly important if most researchers don't have an inclination towards mathematics, but rather maintain a facade of "understanding" out of a fear of being derided by their (equally clueless and scared) peers.
Compare the statement:
We've figured out that a vaccine has no life-threatening side effects by testing it on 100,000 people.
Compared to:
We've figured out that a vaccine has no life-threatening side effects, p<0.01
The former is something that people can actually "get". The latter is something that even a mildly erudite individual might misinterpret, since the same p-value could be present in a quacky study about panpsychism on 80 people, given enough luck, subtle data manipulation, and ignoring similar studies that failed to find an effect.
Sample sizes speak volume, and they also "guard" against the phenomenon where 20 studies might fail to find an effect, but the 21st does. Since fast-fire studies only work when small sample sizes are allowed. If we cared more about sample sizes rather than p-values, we'd probably set rules around sample sizes for various experiment types and have a hierarchy of experimental evidence, where only very large samples are considered conclusive, and small samples are considered indicative
Note: this sample-size > p-value approach is already starting to become the status quo, for example, most modern medical studies, and to my knowledge, most of physics, never really took p-value seriously.
The error distribution is a more complex metrics to look at, but it could be very valuable since it would allow spotting "grey swan" type errors, errors that are rare and significant (that are nonetheless present in our observations) which might cast doubt on the risk of using the model.
An obvious examples here are the difference between the error distributions: [1,0,0,0,0]
and [0.2,0.2,0.2,0.2,0.2]
. Depending on our underlying target one or the other might be greatly more desirable, even though they average out to the same predictive power score.
Figuring out the implications of the error distribution is a difficult task that requires domain expertise. But p-values don't take away the need for this task, they just induce the false belief that it's no longer needed.
Assumptions about degrees of freedom and tests that use them are pointless, and using them complicates the math without accomplishing anything. I invite any reader that's curious about the topics to read, for example, the state of the art in avoiding p-hacking in psychology (a paper that proposes a list of and explains the use of degrees of freedom in computing the p-value) to understand exactly how nonsensical the whole idea is.
vi - Deconfounding
Another technique that a predictive methodology as described doesn't take into account is deconfounding.
I won't go into how (usually ANOVA based) deconfounding is currently done, since it's complex, convoluted, varies on a filed by filed basis and as before, relies on untrue assumptions about how the world conforms to straight lines.
But a predictive methodology has a much simpler and stronger way for deconfounding.
Assume you have two variable A
and B
and want to predict C
based on B
, however you're worried that A
influences B
, and thus the effect of B
is actually that of A
. Is there an easy way to find this out?
Obviously:
- Find a predictive power score for
C
with models that useA
andB
(let's call ite(AB)
) - Find a predictive power score for
C
with models that use - only
A
(let's call ite(A)
)
We can then say that the predictive power of B, deconfounded is equivalent to e(A) - e(AB)
. This is not quite true if the error doesn't scale linearly, in which case the formula gets a bit more complicated and case-specific. But the gist of it remains the same: Build a model that uses both A and B, and one that uses only A, the difference between the two is the influence of B.
Note: If this is a negative number it means B
is random in relation to C
, thus harming the model's accuracy. Also, if e(B) > e(AB)
, which would mean A
is actually sometimes harming the models accuracy, we could instead compute e(B) - e(A)
But what if we want to extract an A-independent predictive component out of B? A bit outside of the scope of this post, but my intuition is that it can be better done by constructing an autoencoder that generate an embedding for B
, then figure out a sub-space of that embedding with a maximally-high error in a model where we try to predict A
as a function of B
. Furthermore, this seems to be an approach that encompasses ANOVA-based methods for this but potentially allows for extracting more information. However, it's a bit outside of my area of expertise to speak of this.
Accounting for confounders outside of the data is still impossible, but that is simply the limit of correlation. Either we must confine our experiment to very strictly controlled environments (e.g. physics, chemistry) or (barring usage of chi-square tests with a whimsical list of a magical number of degrees of freedom), accept that our experiments are biased by their environments, and try to account for this with experimenting in as many different settings as possible, using large samples, and not claiming to have "found" something until there are several replications.
vii - Rephrasing
I assume some people might be confused by how to rephrase various problems in terms of a predictive power score.
This is a matter of "intuition" that I can't quite explain. But if a problem can't be reduced to an inferential issue then it's not a matter for science at all. Science ultimately concerns itself with describing reality in terms of cause and effect (or, better put, in terms of probabilities and event sequences) and thus all of its conclusions are predictions.
For a very simple example, gravity "predicts" that a feather and an apple will fall with almost equal acceleration. More broadly it predicts the acceleration of any object in a vacuum based on the mass and shape of all other objects in the universe (though realistically we usually reduce this to the 1 or 2 objects that comprise 99.9..9% of the effect).
For a more complex example, how do we go about finding a lack of relationships between two variables?
Easy, we have a "random" model and see the predictive power score of that, then we compute our predictive power score and see the difference between the two. If this is ~0, then that means we are doing no better than random. The problem that arises here, obviously, is defining what exactly "random" is, which is closely related to defining what the null hypothesis is. However, by defining an error function as described above, we are closer to defining what random means for our problem.
For example, we might look at the incidence of death, or of a list of all serious diagnosable conditions and their incidence in the general population, or in a control group, or both.
Then we might generate a predictive score for each of them based on various factors (gender, age, preconditions).
Now, to see if a drug increases the risk of death or of a dangerous condition, we compute the predictive power score in our trial group (people taking the drug) and look at the differences.
We could also generate a predictive model on the trial and control group (that doesn't differentiate between the two) for those conditions. Then generate one that does differentiate between the two (i.e. "has taken drug" becomes a variable), and again, look at the difference between them. Ideally, this approach and the one above yield a similar result, otherwise it's likely we picked a biased trial and/or control group.
Overall, if you're having a hard time thinking about your findings as predictions you probably need a more solid epistemological foundation. But I suspect most people will quickly find that any finding can be rephrased as such.
viii - Inferring out of distribution
The other big problem here is making out of sample inferences.
Gradient boosting using decision trees, for example, is probably unideal. Given the data: X = [1,2,3,4,5]
and Y = [1,4,9,16,25]
, it would fail to find the simple "multiply" rule.
To this I would give a few observations:
Finding easy-to-fit models that include much more complex operations. As it stands, one requires a team of qualified machine learning practitioners and domain experts working together to achieve this. It's not impossible, but much too difficult and expensive for a grad student to run on this laptop overnight.
Getting better samples. Out of sample generalization seems to be more so a fluke rather than a rule. It seems that in the vast majority of cases rules we find on biased samples don't generalize. Some do (e.g. gravity), but they seem to be the exception rather than the rule. If you look around you and try to establish rules in a "commonsensical" way with only all available data you will get ideas such as "all couches are blue" and "desks are between 50cm and 80cm all", and it's mostly chance that might make a few of these observations generalize. So I think the problem can be washed under the rug as "selection bias".
Don't apply predictive modelling to fields that have a proven track record of being able to infer out of sample distribution. The model I'm describing here works much better for "soft" science, ending at most branches of biology and some branches of chemistry and physics. The only place where we have sometimes succeeded in predicting out of sample data seems to be physics and chemistry, and the models there are much more sophisticated, thus making the gains from this paradigm smaller.
Model pruning; The reason ml models don't currently generalize to out of sample is that they don't really respect Ockham's Razor, they don't look for the simplest possible explanation. In part, this is because ML is being used in fields where "simple" models don't work to begin with, and where one simply can't generalize from a few examples via straight lines. Many techniques for simplifying models do exist, mainly to save computing power, and it seems feasible that, if the goal was shifted to reducing complexity in a fashion that increases generalizability, better ones could be invented. At any rate, model pruning is still relatively new, I believe PyTorch introduced functionality for it less than a year ago (and it might still be unavailable in TensorFlow), it's an old concept but one that died down until recently, so I hope I'll be able to come back to this in 2 or 3 years with more prescriptive claims.
Alas, out of sample inferences, it seems to me, remain the main reason why whole swaths of science remain the domain of hand made models, although this area is being reduced, and I think it's an interesting problem, one I'm considering dedicating myself to. I'd be very curious to hear what the latest research in this area is, since I'm not that up-to-date on it.
ix - Nothing new under the sun
I want to say that some of the ideas here are "mine", and I can say that in that I've come to some of them independently a long time ago since they are obvious, intuitive, but nothing that I'm prescribing here is new.
The cross-validation technique for generating a predictive power score is basically described in this paper from 1973, and earlier occurrences might exist. They don't make mention of different error functions, but that's about it, and the problem of error function is old as time, to some extent even Bacon describes it in the books that arguably originated the whole thing we call science.
So why is faulty statistics still being used then?
In part, I'm sure it boils down to some edge cases where predictive models are impossible or too hard to use. I'd love to hear about them since I operate on a "very" meta-level vis-a-vis the whole scientific enterprise, I read papers, I don't run experiments, and I'm sure some of my arguments would grow more tempered and useful if I'd be able to see those edge cases.
In part, I'm afraid, it's due to malicious reasons. The errors are mainly found in social sciences, and given that those take root from psychoanalysis and phrenology one can trace the genealogy of "political and status signalling motivations to adopt the mantra of science while not having to adhere to scientific rigour".
But I do think there is a middle ground, mainly around research in the broader sphere of biology where statistical artefacts are bogging things down due to lack of knowledge among practices.
It amazes me, for example, that Horvath is hailed as a genius for applying 3rd year Standford courses ML techniques to the subject of epigenetics (Which is not to say Horvath is not a genius, he most definitely is, but his methods of finding correlations are what should be viewed as bog-standard, not novel and adventurous).
I've been incidentally reading a few neuroscience papers recently, and some (most) are so horribly flawed in the way they try to apply machine learning techniques that it just boggles the mind. The example above is particularly egregious, but in principle, these seem like people with altruistic motivations or at least non-malicious people that got into a field they hate and now must role with the punches. So why are they 48 years behind on the "latest" approaches?
I think there might be a middle ground of scientific study that contains:
- Data that doesn't generalize well out of sample (or where "out of sample" is not a thing)
- Researchers that are motivated to do good work, but can't dig through all the literature and lack the common sense and from-first-principles approaches to just demolish all the bullshit.
- Researchers that lack a rigorous framework for doing predictive modelling, so go back to flawed statistical methods due to more consensus positions being found in the area.
It's the kind of demographic I'm hoping to reach. Not alone, of course, but I hope this is one of many articles that serves to slowly raise the sanity waterline in terms of methodology and to drag more of research (be it kicking a screaming) into the computer era.
This works of popularization has obviously been started a long time ago. People like Taleb have introduced me and many other confused kids to the idea that science is overreaching via bad models. But a lot of these popular writers seem to enter into the other side of the fallacy, where they discard too much of science and build their quack theories on top of the rubble.
I am of the side of the scientific method, it works wonders even in the hands of apes, and alas, it's the best we have, so we might as well try. Throwing out "classical statistics" seems like an obvious next step, and more people seem to be taking that step, and if I help 2 or 3 people further along into doing that I will consider the time writing this article well spent.
x - FAQ
I'm sure there will be a lot of objections and confusion around my prescriptions here. I tried running this article by a few people and I will try to summarize the discussions here via a Q&A format.
I'm paraphrasing for the sake of not copypasting pages of messages with reduced substance.
- Q: "Bayesian statistics" don't have the flaws you assign to "classical statistics", the definition of which seems to mainly refer to "frequentists statistics"?
A: The word "bayesian statistic" is so vague, to the point that when someone uses the term I have a better prior on the fact they eat paleo and hold crypto, than on whether or not they think a probability can be absolute 1 or 0.
But there does seem to be a resurgence of saner approaches under the term "Bayesian statistics" and in part, I agree with most of the claims under that umbrella. Indeed, nothing is stopping you from modelling your priors into the function your fit & cross-validated on, or on using the predictive power score as input to a prior-updating function.
In practice I think what I see under this umbrella often miss the critical points I'm making, namely: CV, name your loss/error function explicitly, use the best possible model. On top of that, most applications of it I've seen, model their priors into a narrow set of "common" distributions and use simplistic techniques to update, which to me seems just as misguided as the classical approach.
So I would say that no, what is covered under Bayesian statistics does not address the issues of "classical/frequentist" approaches, but a certain niche of it might, however that niche basically converges onto about the same techniques I'm describing here.
However naming one's priors is an important thing that I'm glancing over here, and how to model those in a predictive framework would be interesting.
- Q: Aren't neural networks, gradients boosters and other high-parameter count models that are easy to fit horrible at generalization, hence why simpler models are often used?
A: Sometimes, yes, but at most that yields a bad predictive power score, since the way we generate the score (CV) guards against overfitting. Also, I'm not suggesting using those models for everything, my most general take is that you should use meta-models that determine the best model to use (e.g. auto-sklearn).
- Q: Won't machine learning models fail to generalize as much as linear models?
A: Do you believe that the world is made up of lines, or that people gathered data using linear concepts since for a long time they could only think mathematically with lines? Depending on your answer here, this is not an issue.
If you do think that the world is made up of lines, then have no fear, the technique I'm describing above still holds, you just need to make sure to always optimize a linear equation instead of searching for the best possible model.
- Q: Aren't you kind of glancing over 100+ years of applied statistics without addressing some of the finer details?
A: What I'm trying to do here is not explain why classical statistics is wrong in every single instance, but rather showcase a model which is obviously better by virtue of it being easy to understand and more intuitively correct.
My other argument is that predictive models cut down on the biases present in classical statistics (see first few headings), so by definition they are better, they paint a clearer picture of the world by having better assumptions.
Trying to defeat 100+ years of models based on fallacious assumptions that were made because we lacked better hardware is pointless. It's enough to do away with the fallacious assumptions, now that hardware has improved, and assume that every model based on said assumptions is mistaken until proven otherwise in a fair comparison.
- Q: The more I dig into the article the less sense it makes.
A: Sorry, but the first 4 headings should be enough to address the general technique, why it can do away with various assumptions that statistics needs, why it's bound to generate better or equal inferences and why it's easier to understand and guards against hidden biases. So hopefully those bits should be sufficient for most people.
- Q: Your approach to deconfounding is limited by the fact that the "mixed" model can fail to fit properly while the single-variable/take-one-out model ends up close to perfectly fit or vice versa.
A: Yes, and this is a thing that pesters me to no end since I've found to perfect way to do deconfounding thus far.
That being said, it seems to me like linear decomposition methods currently used can also run into "random" issues where a correlation is glanced over or over-emphasized via an artefact of the model.
Unlike the other situations, where I can claim that the "classical" approach is inferior since the predictive approach "includes" it, in this case, I can't make such a claim, and I'm unqualified to do the comparison. I'd really like some further thoughts around this one.
- Q: This approach seems kind of hard for people that can't code.
A: It is, and one of the things I am personally working on is bringing these kinds of models to people that are computer-impaired via graphical or SQL-based interfaces, so I agree.
That being said, code is systematized rigour, it's mathematics/logic that can be computer validated rather than supposed to shakey parsing of the human mind. So, ahm, just take a few weeks and learn some basic python?
- Q: In my subfield I need <domain-specific variable[s] derived from classical statistical approaches> in order to publish. How would this work then?
A: It wouldn't, but you can always run the "classical" analysis to get irrelevant numbers required for the ritual of publishing (e.g. p-values).
If, however, the goal of your paper is to compute a coefficient for another equation, i.e. if your paper has no real-world target, it just targets a different model, then you're either doing physics and don't need this article, or you should reconsider your area of research.
- Q: What about feature/variable importance?
A: You can extract it out of most predictive models, and you can figure it out individually using take-one-out / leave-one-in models similar to the approach used for deconfounding.
But in non-linear-land, the concept of feature importance is often fuzzier by definition, the relationships between features is also non-linear, so you can't just assign them a number between 0 and 1 that holds in all cases and call it a day.
- Q: Your term "classical statistics" is a broad and vague umbrella that covers some very useful models?
A: I'm trying to disparage the statistical paradigm that claims assumptions other than homogeneity are required to perform scientific analysis and/or that they are "inherent to the world".
A lot of the models in classical statistics are the same as the ones used in the machine learning approaches that would probably work for simple problems. And I'm not saying we should do away with them, I'm saying we should treat their assumptions with caution.
For example, fitting a naive bayesian estimator with a Gaussian prior is 100% fine, as long as the resulting predictive power is good, but you can't fit it and not validate the results nor can you assume that the Gaussian prior must be correct and call it "data gathering error" if the results fail to fit.
Parts of the world are modelable by straight lines and that's fine, but we should use models that allow for the existence of arbitrary complexity.
- Q: Why prioritize the homogeneity assumption over everything else?
A: Because unless that assumption holds the whole scientific enterprise is moot, and we all might as well convert to Last Thursdayism. Of course, this means we have to contextualize findings in terms of the experimental setup for any results to hold and start generalizing more and more based on further experiments and real-world applications (which serve as experiment, ultimately), but this is already something that has to happen for any scientific finding outside of completely imaginary/ludic fields.
The alternative to "context-dependent homogeneity" is "homogeneity with exceptions", e.g. "homogeneity but everything must fit a straight line", "homogeneity outside of 45 magically dictated degrees of freedom", which are unfounded assumptions that served as crutches for very simple experiments before the advent of modern mathematics and sufficient funding to run replications, and held poorly and only in very specific fields (e.g. statics).
- Q: What if probability theory is not the correct way to describe the world, and e.g. complex probabilities are more suitable for modelling reality?
A: If this turns out to be true then I think the general idea would still hold, but defining error functions and models might become significantly more complex (pun not intended), but I don't understand enough about complex probabilities to have a high certainty about this.
Published on: 1970-01-01