Big data is when the amount of these data sets we have accumulated become so large and complex to analyse in traditional ways. The challenge then is to find ways of collecting, storing and linking these data in order for researchers to gain new insights, such as determining the causes of a disease or spotting health associations we hadn’t seen before. We can also then find practical applications for these.
The first phase of the Big Data Network, funded by the ESRC, is the Administrative Data Research Network (ARDN) that will provide researchers throughout the UK with access to anonymised data routinely collected by government departments, is underway. And each nation in the UK – England, Wales, Scotland and Northern Ireland – will have its own dedicated administrative data research centre, with a fifth administrative centre in Essex.
Linking data to find new insights in health isn’t a new idea. In the 19th century, William Farr, a British epidemiologist and early pioneer of medical statistics, was able to demonstrate which areas experienced the highest death rates by linking numbers of deaths with population estimates.
More recently, linking together information about deaths and hospital admissions with area of residence and modelled information on flight paths showed an increased risk) of heart disease and stroke, associated with exposure to high levels of aircraft noise.
Linking records is an efficient and cost effective way of finding patterns, without the need to find more participants or conduct fresh new studies. The method has been used to convert cross-sectional health surveys into longitudinal studies, like a recent example which found that even modest elevations of psychological distress were associated in increased risk of death from all causes and from cardiovascular disease. It has also been used to follow up clinical trial participants. We have a wealth of information we can use so the big data push is to make the most of it.
The launch of the network follows and complements the establishment of the Farr Institute, a health informatics centre that will research how healthcare professionals can make best use of all the data we have to deliver better care for patients. It is funded by the Medical Research Council as part of a wider strategy to integrate clinical, genetic and other biomedical data, and the aim is to attract investment from pharmaceutical and IT companies.
We also now have honest brokerage services (bodies that can have data warehouses where sensitive data is centrally collated and where trusted linkage services can accurately join disparate datasets) and safe havens (secure settings where de-identified information can be analysed).
Opening up even more
The new centres will open up access to many more datasets, although it is likely that further scrutiny of existing legislation or enactment of new legislation may be needed to allow some government departments to link their information. This might include the Department for Work and Pensions (which handles social security benefits data) and HM Revenue and Customs (which handles data related to taxable income).
Being able to compare more datasets from wider sources will mean more robust evidence informing policies and better evaluation of current ones. We’ll have a greater understanding of what influences social mobility and inter-generational patterns of poverty and ill-health, for example; be able to study the “life-course” of people through the benefits system, and the wider effects of changes in eligibility and perhaps income on individuals and families; and find greater clarity about the relationship between the poor mental health and the criminal justice system.
For all of this to work, it’s important that the public (and those looking after the data) have the maximum confidence that their data are being used appropriately and safety and security is paramount. The proposed system and the administrative centres will ensure that trusted third parties who link records together never see identifiable administrative information and that accredited researchers and ADRC support staff never have access to, or sight of, personal identifying information.
Significant investment will also need to be made in metadata from respective governmental agencies if the scope is to be big enough. Understanding the structures and caveats of these new and complex datasets will incur a steep learning curve for prospective researchers. And the diversity of the data will almost require a whole array of disciplines to work together across the social and health sciences, including geography, economics and finance, and the sheer magnitude of the data to be analysed will require new approaches and techniques.
It may even encourage novel collaborations between computational and information scientists like those from the physical sciences and astronomy.
The big data idea is just the start. More exciting still will be making it happen and seeing what new and maybe surprising insights come from these new collaborations.