Google Flu Trends – The Conversation

Digital epidemiology: tracking diseases in the mobile age

2015-03-04T10:54:01Z

It must be the flu, right? Woman via Shutterstock

Being stuck in bed, waiting for the flu to run its course, is pretty unpleasant. And it’s also really boring. What else is there to do but search for symptoms online, and read entries about the flu on Wikipedia or WebMD or post messages on Facebook and Twitter about how sick you are?

A lot of people get the flu every year and many of them do exactly that: they search for relevant information, and share their misery with the rest of us. The consequence is remarkable: a description of their symptoms, time-stamped and perhaps even geo-tagged, is online. Which means that the internet has a rather detailed picture of the health of the population, coming from digital sources, through all of our connected devices, including smartphones.

This is digital epidemiology: the idea that the health of a population can be assessed through digital traces, in real time.

It has the potential to be a powerful boon for traditional epidemiology. Researchers have already started to develop methods and strategies for using digital epidemiology to support infectious disease monitoring and surveillance or understand attitudes and concerns about infectious diseases. But much more needs to be done to integrate digital epidemiology with existing practices, and to address ethical concerns about privacy. By 2020, there will be 6.1 billion smartphone users, so it is high time to get serious about digital epidemiology.

Projected smartphone subscriptions 2014 to 2020. Ericsson, CC BY-NC-ND

Digital epidemiology goes mainstream: Google flu trends

Google Flu Trends was one of the first popular examples of digital epidemiology. Launched in 2008 to help predict flu epidemics, it was based on a very simple idea: when people come down with the flu, they will often turn to the internet and search for information about their symptoms.

In 2009, researchers from Google and the US Centers for Disease Control and Prevention (CDC) published a paper with the apt title “Detecting influenza epidemics using search engine query data,” outlining a method for using search queries to recognize flu outbreaks.

For many years, Google Flu Trends has served as a prime example of digital epidemiology. It embodies both the opportunities and the challenges the field faces. While it has undoubtedly popularized the idea of using digital data to derive epidemiological insights, Google Flu Trends has also demonstrated that this is no easy task.

For starters, its estimates were not always very accurate. Indeed, during the 2012-2013 flu season in the northern hemisphere, it overestimated the flu prevalence by up to 100% (relative to CDC numbers). And the estimates cannot be reproduced easily – Google controls access to Google data, of course.

For this reason alone, many researchers have in the past few years turned to alternative data sources. Twitter has been a particularly popular source, because tweets are public by default, and because Twitter data can be accessed by anyone.

Twitter and Wikipedia are becoming data sources for digital epidemiology

For instance, a study from 2011 used data from Twitter to measure public interest and concern about the influenza H1N1 virus and to track disease activity. Another study from 2014 showed that incorporating data from Twitter into CDC influenza-like illness models can reduce forecasting errors. Twitter has also been used to assess health sentiments such as those about vaccination, and to monitor drug safety.

And Wikipedia access logs – open accessible data about how often certain Wikipedia pages were accessed on the web – have recently provided a rich data source for disease monitoring and forecasting. Research suggests that examining Wikipedia access logs could support traditional disease surveillance for influenza.

A doctor uses a smartphone to conduct an eye exam in Kenya, October 29, 2013. Noor Khamis/Reuters

The doctor is in your pocket: epidemiology goes mobile

But it’s not just publicly accessible data from Twitter and Wikipedia that have been harnessed for epidemiology. Anonymized mobile phone data have provided unparalleled insights into how the movement of people affects disease dynamics.

For example, cell phone data have been used to measure how human travel patterns spread malaria and to rapidly estimate population movements during disasters and outbreaks, such as the earthquake and subsequent cholera outbreak in Haiti in 2010.

Apps that allow the self-diagnosing of diseases are not too far away. With the help of a small attachment, a smartphone can already be turned into a mobile clinic able to diagnose multiple infectious diseases in minutes.

Map generated by more than 250 million public tweets (collected from Twitter.com) with high-resolution location information, broadcast between March 2011 and January 2012. Salathé et al., CC BY

Traditional + digital = a better picture

Public health is traditionally based on data collected from health-care providers, who collect data from sick patients. This produces a very patchy picture. It only includes those populations who have access to health care or who decide to go to the doctor in the first place. And it mostly includes information about reportable diseases, missing out on a huge array of other illnesses. Last but not least, it largely misses out on information about health behaviors, sentiments and opinions.

Digital epidemiology can add more information to that picture and fill in some of the blanks. Of course, digital epidemiology won’t capture the entire population. But, neither do traditional ways of gathering epidemiological data. With the vast majority of the world getting online, populations who slipped under the radar of public health will become more visible, which is crucial in a world where diseases anywhere today are diseases everywhere tomorrow. And it will also enable us to fulfill the mantra of “early detection, early response” by building digital warning systems designed to stop pandemics in their tracks.

Don’t forget privacy and surveillance

Digital epidemiology faces ethical challenges about surveillance and privacy as well. Ill health is stigmatized – socially and economically – in all societies. And people are more and more concerned about surveillance and information privacy. As digital epidemiology grows, we need to keep these ethical considerations at the forefront.

Marcel Salathé does not work for, consult, own shares in or receive funding from any company or organization that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.

Google’s Larry Page wants to save 100,000 lives but big data isn’t a cure all

2014-06-27T14:24:32Z

Machines lack the human touch needed in healthcare. Shutterstock robot surgery

Talking up the power of big data is a real trend at the moment and Google founder Larry Page took it to new levels this week by proclaiming that 100,000 lives could be saved next year alone if we did more to open up healthcare information.

Google, likely the biggest data owner outside the NSA, is evidently carving a place for itself in the big data vs life and death debate but Page might have been a little more modest, given that Google’s massive Flu Trends programme ultimately proved unreliable. Big data isn’t some magic weapon that can solve all our problems and whether Page wants to admit it or not, it won’t save thousands of lives in the near future.

Big promises

Saving lives by analysing healthcare data has become a major human ambition, but to say this is a tricky task would be an enormous understatement.

In the UK, the government has just produced a consultation on introducing regulations for protecting this kind of information alongside care.data, a huge scheme aiming to make health records available to researchers and others who could work with it.

Given the ongoing care.data debacle, this is a broadly sensible document and a promising start for consultation. In particular, it identifies different levels of data. Data that could be used to identify an individual person should not be shared in the same way as other types of data.

But, like Page, the UK government is also presenting a false vision for big data. It has said review after review have found that a failure to share information between healthcare workers has led to child deaths. It’s an emotive admission but rather beside the point in the big data perspective.

It is indeed entirely credible that many tragic failures within the NHS might have been prevented by someone sharing the right information with the right person. Sharing is essential, but when the NHS talks about sharing, it means linking and sharing large medical databases between organisations. Surely no case review has ever claimed that the mere existence of a larger database of information would have got the right knowledge to the right person.

Medical data sharing may be a good thing in many ways, but unfortunately there is no clear case yet that automated analysis of data prevents child deaths and other tragedies. It is only big data, not magic. Preventing child deaths appears to be brought in as emotional blackmail, expected to trump the valid concerns over the NHS’ big data plans.

Big disappointments

The fact is, we are not as advanced as we would like to believe. This month, 60 years after Alan Turing died, his test for recognising “true” artificial intelligence made the news again. One in three human test subjects mistook a computer programme called Eugene Goostman for a 13-year-old Ukrainian boy. But Eugene didn’t really pass the test. The programme was simply good at playing the game and relied heavily on the fact that a 13-year-old probably wouldn’t know the answers to many of the questions.

The programme fell back on the same tactics used some 42 years ago by Parry, a programme that tricked people into thinking it was a paranoid schizophrenic, and the even earlier Eliza programme which had proved hard to distinguish from a real Rogerian therapist. So much for progress.

The research field of artificial intelligence – or more modestly, machine learning – has been active for 60 years and passing the Turing test is its original Holy Grail. And many of the brightest minds in computer science have worked in this area. Computing power has been increasing exponentially over that time and the web provides a massive amount of samples of human communication to learn from. The fact that we have made such slow progress despite all these developments shows just how hard it is to turn vast amounts of data into human intelligence.

Be wary of big claims

This should teach us to be wary of anyone who makes bold claims about the potential of big data. Google Flu Trends sought to derive information about the spread of illness by gathering data when people searched for terms like “flu”. But we’ve seen time and time again that machines don’t understand humans and can’t mimic real human qualities.

A prime example can be found outside healthcare. It’s now broadly accepted that in the course of its surveillance programmes, the NSA had obtained information that might have prevented 9-11, but failed to join the dots.

Edward Snowden’s revelations made it clear that the NSA and GCHQ are collecting large “haystacks” of communications data. The intelligence services have made various claims that the analysis of this prevented serious terrorist attacks, but these claims have not stood up to detailed scrutiny. Given the amount of computing power the NSA possessed, even before the internet age, it must have been applying machine learning techniques to its bulk data for at least 30 years. Still, no evidence has been presented of any significant needles being found as a result – at least not any that is available to the public.

This all goes to show that using machine learning to process vast amounts of data, such as the information held in healthcare databases, won’t save lives alone. The kind of human insight needed to put the information to proper use still can’t be replicated by computers, even after decades of trying.

Doctors need to be able to ask the right questions and use their unique human qualities to make life changing decisions for their patients. Similarly, researchers still need to formulate their hypotheses and ask the medical databases targeted questions. They are not machines, and we should be grateful for that.

Eerke Boiten is a senior lecturer in the School of Computing at the University of Kent, and Director of the University's interdisciplinary Centre for Cyber Security Research. He receives funding from EPSRC for the CryptoForma Network of Excellence on Cryptography and Formal Methods. He is a member of BCS and board member of its specialist group on Formal Aspects of Computer Science. He is also a director (governor) of The John of Gaunt School, a Community Academy.

Google’s flu fail shows the problem with big data

2013-10-24T18:05:58Z

A tower of used books

Is more data better data? Jorge Royan

When people talk about ‘big data’, there is an oft-quoted example: a proposed public health tool called Google Flu Trends. It has become something of a pin-up for the big data movement, but it might not be as effective as many claim.

The idea behind big data is that large amount of information can help us do things which smaller volumes cannot. Google first outlined the Flu Trends approach in a 2008 paper in the journal Nature. Rather than relying on disease surveillance used by the US Centers for Disease Control and Prevention (CDC) – such as visits to doctors and lab tests – the authors suggested it would be possible to predict epidemics through Google searches. When suffering from flu, many Americans will search for information related to their condition.

The Google team collected more than 50 million potential search terms – all sorts of phrases, not just the word “flu” – and compared the frequency with which people searched for these words with the amount of reported influenza-like cases between 2003 and 2006. This data revealed that out of the millions of phrases, there were 45 that provided the best fit to the observed data. The team then tested their model against disease reports from the subsequent 2007 epidemic. The predictions appeared to be pretty close to real-life disease levels. Because Flu Trends would able to predict an increase in cases before the CDC, it was trumpeted as the arrival of the big data age.

Between 2003 and 2008, flu epidemics in the US had been strongly seasonal, appearing each winter. However, in 2009, the first cases (as reported by the CDC) started in Easter. Flu Trends had already made its predictions when the CDC data was published, but it turned out that the Google model didn’t match reality. It had substantially underestimated the size of the initial outbreak.

The problem was that Flu Trends could only measure what people search for; it didn’t analyse why they were searching for those words. By removing human input, and letting the raw data do the work, the model had to make its predictions using only search queries from the previous handful of years. Although those 45 terms matched the regular seasonal outbreaks from 2003–8, they didn’t reflect the pandemic that appeared in 2009.

Six months after the pandemic started, Google - who now had the benefit of hindsight - updated their model so that it matched the 2009 CDC data. Despite these changes, the updated version of Flu Trends ran into difficulties again last winter, when it overestimated the size of the influenza epidemic in New York State. The incidents in 2009 and 2012 raised the question of how good Flu Trends is at predicting future epidemics, as opposed to merely finding patterns in past data.

In a new analysis, published in the journal PLOS Computational Biology, US researchers report that there are “substantial errors in Google Flu Trends estimates of influenza timing and intensity”. This is based on comparison of Google Flu Trends predictions and the actual epidemic data at the national, regional and local level between 2003 and 2013

Even when search behaviour was correlated with influenza cases, the model sometimes misestimated important public health metrics such as peak outbreak size and cumulative cases. The predictions were particularly wide of the mark in 2009 and 2012:

Original and updated Google Flu Trends (GFT) model compared with CDC influenza-like illness (ILI) data. PLOS Computational Biology 9:10

Although they criticised certain aspects of the Flu Trends model, the researchers think that monitoring internet search queries might yet prove valuable, especially if it were linked with other surveillance and prediction methods.

Other researchers have also suggested that other sources of digital data – from Twitter feeds to mobile phone GPS – have the potential to be useful tools for studying epidemics. As well as helping to analysing outbreaks, such methods could allow researchers to analyse human movement and the spread of public health information (or misinformation).

Although much attention has been given to web-based tools, there is another type of big data that is already having a huge impact on disease research. Genome sequencing is enabling researchers to piece together how diseases transmit and where they might come from. Sequence data can even reveal the existence of a new disease variant: earlier this week, researchers announced a new type of dengue fever virus.

There is little doubt that big data will have some important applications over the coming years, whether in medicine or in other fields. But advocates need to be careful about what they use to illustrate the ideas. While there are plenty of successful examples emerging, it is not yet clear that Google Flu Trends is one of them.