This blog is no longer active, you can find my new stuff at:,,, and

Against private clinical data

Note: This article is specifically about the public healthcare systems as implemented by most European countries. It does not apply to the US healthcare system.

A few months ago covid-19 became a thing, and I hoped I could get access to some raw data to check how bad things were. I saw a lot of studies coming out with low n and underpowered statistical methods, and I naively thought, gee, I'd better take a look at this myself.

After an hour or so of digging, the best I found wasthe DXY spreadsheet

Random cases compiled by doctors and data aggregators manually filling in a spreadsheet. Containing age (usually), gender (usually), status (i.e. recovered/dead/case still in progress... but very rarely, maybe 1/10 records). It's missing a lot of important things such as:

This seems od, but whatever, by this point de-facto martial law has been called where I live and mandatory quarantine is instated, so I don't have to worry much about infection and fatality risk one way or another.

Then 2 weeks later the company I work at, a company doing ML related stuff, thinks that analyzing the covid-19 data in order to make predictions about at-risk cases is something we should be doing... who knows, maybe we can help, probably we can get some publicity of of it.

I say fine, I go searching for data for what ought to have been at least 4 hours, I find nothing other than the spreadsheet linked above, with more cases, but with worst quality data for those (which is an impressive feat, considering how little data it had to begin with).

I also find a few other data sources, the best one being data for cases in South Korea. But they all have the same minimal information I mentioned above, information that is insufficient to make any worthwhile inference.

So I start posting questions of various forms and subreddits, I start google translating Chinese and Korean medical discussion boards, I once again dig through 10+ pages of search engine results.

Patient Privacy

As far as I can tell this lack of that data is related to patient privacy. If you are diagnosed, or even tested for any disease, including covid-19, you are now a "patient". So all data relating to your case is now protected under a set of arcane laws that make it more expensive to release to the general public than it is to treat the patient.

In theory, a very altruistic hospital might go through the effort of anonymizing the data and releasing it. In practice, that means months of bureaucracy, millions of dollars spent filling in the paperwork, and greasing whatever wheels need some lard. For that, you are awarded a high risk of getting sued for many more millions and potential negative publicity for being reckless with patient data.

As far as I'm concerned, this generalizes to data about any disease or condition someone might have. For most it's not as obvious in that case, because plenty of data from previous studies exist, and because data from before our modern privacy laws exist.

Bioethical murder

Let's for a second analyze the cost of not releasing this data.

In general, the way any novel disease will be handled is by associating it with some set of symptoms that we have a treatment for. Thus prescribing a treatment protocol to serve as the basis for taking care of that patient.

In the case of covid-19 the main candidate was ARDS (Acute Respiratory Distress Syndrome) but some people were saying it usually more closely matches HAPE (High Altitude Pulmonary Edema) and I'm sure there's 10 more alternative ideas out there.

Let's say that, by looking at the data, we could find a better symptom -> treatment-protocol mapping. Not anything very glamorous, maybe something that reduces the number of cases that die in hospitals by ~5%.

That's ~11,000 people that are dead, which would now be alive.

But, consider that analyzing the data, we could probably get information about many other things, from how to triage at-risk people, to how various drugs affect the disease (thus helping aim research), to pinpointing very narrow risk demographics, which we can divert efforts towards maintaining in isolation.

Or, consider that we see something like "95% of patients said they didn't wear masks, 5% said they did, but in the general population of their city, 25% of people report wearing masks when they go outside"... while not conclusive, and somewhat simplified (e.g. people are biased towards thinking they didn't wear a mask when infected, even if they sometimes wore one); This could still provide guidance on mask-wearing. Or rather, it could have provided that guidance 2 months ago.

Also, keep in mind that it wouldn't be me or you looking at this data, in the real world you can probably gather hundreds of top-notch experts in epidemiology, general medicine, ICU medicine, statistics, machine learning, applied mathematical modeling... etc.

Some of these people have access to some of this data now, but not to all of it. I think that, by this point, I've seen studies with numbers in the high hundreds. On the other hand, there should be 1 to 4 million data points to work with, maybe lower if you want higher quality data, 500,000, or even 100,000 or even just 1,000 in the very beginning. However, that's still much more than what we have now, where people can at most get data from a few hospitals.

To think that this data wouldn't have lead to models that would have helped save lives and halt infection is to throw away the assumption of the scientific method working to cure disease and guide medicine. Our default assumption should be that observing a phenomenon will help us understand it and model it's flow. I don't see why this would be different with covid-19.

Maybe it would have been only 1,000 deaths being adverted, maybe 100,000, but lives would have been saved, and many people that now have chronic damage might have avoided it.

Not to mention, this is still ongoing, every day a few thousands of people die, every day some hundreds or thousands could be saved by models based on the unreleased data.

But releasing medical data is bad

Going into why we think releasing anonymized medical data is bad is beyond the scope of this article.

However, the most fundamental argument against releasing private medical data, with or without the patients' consent, centers itself around:

a) Releasing medical data that are not properly anonymized might be harmful, the patient might not be aware of the harm when signing the consent form

b) The patient has a right to refuse the release of their data

So basically it boils down to avoiding harm and to maintaining a niche of our right to privacy.

This argument holds some water in the normal world, but currently, most European countries are:

We do all of this, and we think it's worthwhile because it saves lives.

But we can't release data from infected people's last visit to the hospital, because obviously medical privacy trumps the fundamental rights enshrined in most democratic constitutions.

I don't claim to know exact numbers here or even inexact numbers. Maybe releasing the data saves dozens of thousands, hundreds of thousands or maybe almost everyone avoids death because someone smart notices an unlikely pattern which holds the key to it all.

This would be hard in e.g. the US, but in the EU most hospitals in most countries belong to a national healthcare system, subordinate to the government. The governments could, at any time, demand each hospital deliver a spreadsheet with standardized format covid-19 patient data. Most doctors would happily comply because I'm pretty sure more of them are also into the live-saving business.

This might be against the law, yes, but taking away half the fundamental rights enshrined in the constitution also is, but that's why emergency-related laws and martial law exist. I think that, given our current sacrifices, nobody would see it as an abuse of power to clammer for a bit of extra clinical data without passing it through a years-long approval process.

Maybe clinical data would serve to only reduce fatalities by 1%, maybe by 99%. But, at any rate, we currently live in a world where taking away the rights most fundamental to a free-market democracy and plunging our nations into economic recessions that could last for years is seen as a reasonable cost. Releasing a few weeks of hospitalization records for the same purpose is seen as going too far.

This is so mind-boggling stupid, I will mark it as a new height of regulatory failure during my lifetime.

Published on: 2020-05-12



twitter logo
Share this article on twitter
 linkedin logo
Share this article on linkedin
Fb logo
Share this article on facebook