NHS must think like Google to make data project work

As the UK government wrangles with the sticky problem of how to make health records useful for research without compromising privacy, it might look to how Google has evolved for inspiration.

Google was once little more than a database. But with time, it has had to morph into something much more complex as a result of various pressures. This is exactly what care.data needs to do now.

Evolution of a giant

Google’s search engine has become a way for people to access sensitive and personal information – and as a consequence it has become more than just a resource: it has had to evolve to address the legal and ethical consequences of its potential. This is what is sorely lacking with the proposed overarching medical database in the UK, care.data.

Google gained market dominance as a search engine in 2001, providing a simple service. It didn’t know who you were (you couldn’t even log in) or what you had searched for previously. It just seemed to be finding more search results than previous market leader Altavista, and it presented them in a useful order.

In software engineering, an interface is a simplified view of software which concentrates on what values are put in and what outputs they produce. The essence of the Google interface consists of only three operations: entering a search term, navigating through the outcomes and following a link.

Behind the screens is a large database of information about web pages, which changes over time as Google spots changes in the web. It keeps full control of all copies of this, not least because it is too big to share and contains clever ideas that the company would rather keep under wraps.

Originally, the results of a search query would not depend on your previous Google interactions. Now they do, with Google aiming to build up a precise image of you, what you are searching for, and why. It also allows the company to fine tune what it serves you. The care.data project lacks this kind of sophistication: data is viewed as just a resource.

From a database to a controlled interface

A basic database is not controlled or monitored, and the results of queries are not dependent on past queries or other external information. The database and interface are almost inseparable, forming just a data resource.

This is a perfectly sensible approach when it contains no sensitive information, and it is not expected to support a more abstract, higher level, functionality beyond answering queries.

This is precisely how Google has changed since its early days. The old Google was just a service to find out which web pages contain a certain bit of text. Nowadays, Google is a way of finding out about topics and from the legal perspective it is an entry point to sensitive information. It has had to modify its service, in particular its control of the interface, as a result.

Some information, such as illegal pornography or copyrighted material, exist in the database because they are on the web but cannot be allowed show up in search results. Personal information is proving more of a headache. Google faces questions about whether it should remove personal information about people from search results if they ask it to.

There has been widespread speculation that implementing this right to be forgotten is a logistical impossibility, but given the variety of existing mechanisms already available and the scale of copyright-related filtering that already happens, this may not be so hard for Google after all.

Health databases

Unfortunately, the public debate about a unified treatment of healthcare data has so far failed to move beyond the basic database perspective. The tone was set by David Cameron in 2011, when he announced his intention to exploit the mountains of data produced by the NHS.

But a barrage of negative publicity ensued. The sensitive nature of medical data was waved away with the reassurance that all the information would be anonymous. Questions still remain about who gets access to the data and what the overall purpose would be.

The narrative on anonymity may now finally have been fatally undermined. Researchers had long ago established that the usefulness of these databases lies in their rich and longitudinal character. Long and detailed stories about people’s health and treatments give deeper insight and a better chance of explanation for their medical histories - but, by much the same process, also into who that person is.

One of the biggest problems is that Health & Social Care Information Centre (HSCIC), the arm’s-length government organisation in charge of care.data, habitually treats medical information as a commodity to be shared freely.

Insurance and pharmaceutical companies have had extensive access to the hospital data it holds. There is an industry of data analytics companies that sell NHS data back to the NHS in a digested or more accessible form. Kingsley Manning, Chair of HSCIC, had to admit in parliament last month that he could not even say who the end users of the data were.

So far, care.data has been presented as yet another database resource for HSCIC to share in full or in part. Given how sensitive the data is, and how sensitive the public is to the kinds of uses made of this, this is insecure and unethical.

Privacy and security risks need to be managed rather than ignored. In other words, the interface needs to be controlled.

A controlled interface

In the care.data advisory group established recently, there has been discussion of a “fume cupboard” model for access to care.data. This would put in practice some of the lessons from Google’s history.

HSCIC would not share the database, but give others controlled and monitored access to the interface. Established security mechanisms such as Role Based Access Control could play a part in ensuring that queries match a defined purpose or policy for each type of user. Existing mechanisms, including automated ones, for detecting insider attacks could monitor and dynamically change access policies.

This would put a wealth of modern security engineering technology at the disposal of the protection of one of the most valuable data sets ever established. Better late than never.

An extended version of this article is available on the author’s blog.

NHS must think like Google to make data project work

Author

Disclosure statement

Partners

Evolution of a giant

From a database to a controlled interface

Health databases

A controlled interface

Want to write?