Your NHS data is completely anonymous – until it isn’t

Can you ever be anonymous in your doctor’s surgery if your doctor isn’t? MTSOfan

The debate about uploading nearly all data from UK medical practices to a national database continues to cause concern. Responding to fears that the information held in the database will put patient privacy and security at risk, Tim Kelsey, NHS England’s national director for patients and information told the BBC: “Can I be categorical? No one who uses this data will know who you are.”

But a look at the information available in the database shows that Kelsey either knows something we don’t, or doesn’t fully understand anonymisation.

Definitive information about the contents, and potential use and abuse, of the database is still missing. Not everyone has received the information leaflet that introduces the system into which they will be automatically enrolled and not all GP practices are fully aware of how people can opt out.

Contradictory information is being circulated about just who will get access to your health data through the project. The position on insurance companies, for example, is ambiguous. The NHS has recently said no data will go to insurance companies at all, yet Bupa, a private healthcare company that sells insurance, has been approved as one of the organisations that can access it.

And whatever the current situation actually is, there is a sense that later legislation or privatisation may change it all anyway.

But what the NHS means by anonymisation is possibly the biggest question of all in this debate. There are three levels of data in the system: red data contains highly personal information and will be tightly regulated, while green data contains no personal information but will be shared. The category in the middle, “orange” data, is the main battlefield and source of confusion. This data will be shared outside the NHS, and sold (at cost) to “suitable” third parties.

In order to protect the patient and make sure the data is still useful to whoever is accessing it, the best option is “pseudonymisation”. The data will contain personal information like diagnoses and prescriptions but will be attached to a single meaningless pseudonym rather than a name. In this case, postcode, NHS number, gender and date of birth are removed from red data.

However, even pseudonymised health data can be re-identified. The story of Latanya Sweeney, who identified her local governor through pseudonymised health records to prove to him that they were not secure, shows that it is not just a far-fetched theory punted around by privacy advocates.

But some in the NHS appear to be oblivious to this. Kelsey’s emphatic comment that no one would be able to tell who you are, made in the specific context of pseudonymised data, was particularly surprising. If he looked a little more closely at the kind of information that will be stored in the database, he might not be so confident.

Zoning in

A fully pseudonymised database will have records consisting of a meaningless patient identifier, a clinician, an event date, a “Read Code”, and some data associated with the code. Read Codes are four or five character shorthands for “patient phenomena” and an official database of the nearly 300,000 different ones exists. Some 50,000 of these represent different prescriptions. The full database is available from HSCIC with a licence, but many of the codes can be found online.

This information is sometimes enough to rediscover missing information of an individual patient and, more often, can at least be used to progressively narrow down a set of possible people until there is only one left. We are in a context of “big data” here – so this kind of activity can be performed systematically and automatically.

There are many pieces of information in your health data that can indicate your age, for a start. The timing of vaccinations, including the ones that happen just past birth is one. Birth itself is likely to create a fair few entries in any case. Then there are more esoteric codes, such as one that indicates that someone has had a “fall after 75”, which add more pieces to the puzzle.

Your address might not be readily available in the database but a good idea about location can be extracted from the GP information. There is no indication that GP information will be hidden. It’s extremely likely that the GP you visited most recently is near your home. If you tracked the data over time, you could create a “location history” that maps on to an individual’s house-moving pattern.

From location histories you can extract some idea about family relationships: families tend to use the same GP and move home together. If the timing of one family member’s pregnancy matches the timing of another’s birth, you may have found a mum.

And working out someone’s gender from a pseudonymised health database is often even easier, especially for women. Pregnancies, contraceptive prescriptions and HPV vaccinations are all pretty obvious signs, as might be a prostate examination for a man.

Very few people are secretive about the ages of their children, nor about where they have lived. Thus, shared location histories matching the known ages of children are likely to identify a large number of people. For example, someone with four children who moved towns twice is very likely uniquely identified from any pseudonymised NHS database. It could be you, it could be me.

The NHS and Tim Kelsey really need to take a look at how much the data reveals before they make any bold claims about anonymity.