Sections

Services

Information

UK United Kingdom

Funding bodies will have to force scientists to share data

The open access movement is forcing publishers to take down paywalls, making publicly funded research available to the public for free. But beyond that a more important development is pacing in the wings…

Sharing raw data is not much harder today than rearranging scrabble tiles. Justin Grimes

The open access movement is forcing publishers to take down paywalls, making publicly funded research available to the public for free. But beyond that a more important development is pacing in the wings - that of open data.

With open access the issue has been free access to the results of scientific work. However, by “results” researchers really mean published papers which, bluntly, are only what scientists write about after looking at their data. With the open data drive, advocates are saying that the actual raw data should be available too. Anyone could then pick over, explore and re-use the data. This shift represents a behavioural sea-change that will also fix some substantial threats to the integrity of science.

The benefits of open data are clear. First, just the knowledge that the raw data will be out there for other analysts to check may make researchers more responsible about their data. Second, there is vast potential in the re-use of data. Researchers sometimes invest large amount of resources in collecting data only to publish one slice of that before having to move on to new projects.

Sometimes they do not even have time to publish anything, or feel that their results are not good enough to publish - whether that be rooted in their belief about how “negative results” will be received by journals and their peers, or whether writing up something unexciting is just not worth it. About half the results presented at conferences are not published in journals, about half the projects funded by public money never produce any journal articles and negative results from clinical trials often get pushed under the rug.

This means that the “results” out there in the scientific literature are a warped representation of the data that has been collected. Add to this the sheer waste of developing a database then throwing it away once the tusk of a nice finding has been poached from it. If the primary researchers do not have the time to fairly represent everything they have collected, why not just put the data out there? Sharing the data is a fix to our current ills. Yet the data sits in hard drives of scientists around the world.

What’s the hold up?

Limited infrastructure was one excuse not to share such data. But even when some universities built data archives ready for a data deluge, scientists avoided using it. It is not that researchers disagree with idea of sharing data, but they have apprehensions about with putting raw data “out there”.

First, there will always be a better statistician than you somewhere in the world, who can simply take your analyses apart and do it better. That is uncomfortable. Worse, what if someone somewhere does a hatchet job and claims your data “shows” something it does not? What about legalities around patient privacy and consent, or discoveries from your data or patents? Finally, what is in it for an individual scientist or even a research group?

Scientists understand the need for sharing data openly, but they lack the incentive. Yet there may be a way forward by tapping into the concepts of database citations and “data papers”.

The reputation currency of a scientist is often measured by how many papers he or she has published and how many times those papers are cited by other scientists in their papers. While it is not a perfect metric, it is widely used by journals.

The idea then would be to apply such a metric to databases. Assign a unique identifier to a database that can be cited like papers. Thus credit is given to the authors of that database. Some new “data journals” are going a step further by inviting scientists to write citable data papers to complement those deposited databases. These papers detail everything needed to use the data without pestering the original authors.

As a researcher, the ideal scenario for me would be that I hand in my database to the funding body at the end of the project. They check that the data is good, nominate a repository and I write a data paper for the repository. Once that is done, I am granted a grace period to finish writing research papers before the raw data gets released to the outside world.

Only a few funding bodies have mandated sharing, but they are not enforcing it. Weak sticks and theoretical carrots will not be enough to drive scientists into this bold new territory. The culture of sharing raw data will only truly begin when researchers are forced to do so by funding bodies.

Join the conversation

7 Comments sorted by

  1. Gavin Moodie
    Gavin Moodie is a Friend of The Conversation.

    Adjunct professor at RMIT University

    I agree that open access to data would be a considerable advance. Data citations seem a good encouragement to publishing data.

    report
    1. Michael Galsworthy

      Senior Research Associate in Health Services Research at University College London

      In reply to Gavin Moodie

      Thanks, Gavin. Yes it's something that certainly needs to happen and will...

      It's now just a case of raising awareness, putting a useful system in place to do it (compulsory data sharing plus incentivising doing a good job of it), and then ironing out the difficulties. Initial fiddly difficulties will be many (another essay), but nothing that can't be overcome as we work it out.

      report
    2. Chris Taylor

      logged in via Twitter

      In reply to Michael Galsworthy

      Hi,

      There is some infrastructure in place from DataCite: 'DOI names can be used for any form of management of any data, whether commercial or non-commercial.' [http://www.datacite.org/whatisdoi]

      As for motivation, certainly citing is one aspect that will help as it fits existing infrastructure, but referees need to ask authors where data came from if they don't make that explicit (some journals require this already). http://figshare.com/ and http://datadryad.org/ both support citeable data…

      Read more
    3. Michael Galsworthy

      Senior Research Associate in Health Services Research at University College London

      In reply to Chris Taylor

      That's an absolutely top comment, Chris - thanks for that.

      Alongside Dryad and Figshare I would like to note that some leading universities (including my own, UCL) are developing institutional repositories that are looking to link with DataCite and fully integrate in the new way of doing things.

      Excitingly, CERN has also recently developed an online database sharing resource in collaboration with the excellent EU Open Access project OpenAIRE [http://horizon2020projects.com/excellent-science/europe-research-database-launched/]. It's called Zenodo and you can see it here: http://zenodo.org/ . If you scroll down, you may even spot the occasional Altmetric [http://www.altmetric.com/] doughnut/ring measuring social impact. They have integrated that already... How cool is that?

      report
  2. Kris Rogers

    Biostatistician

    Hi Michael,
    I was wondering if you could comment on the practicalities of sharing data that is potentially identifiable or sensitive? Most of my work involves personal information that in Australia must have access restrictions. I am very pro sharing of code/tools etc., but I've never been able to personally reconcile this attitude with identifiable data (most person level information is potentially identifiable) where custodians have a duty to restrict access. What's the solution?

    report
    1. Michael Galsworthy

      Senior Research Associate in Health Services Research at University College London

      In reply to Kris Rogers

      Hi Kris – absolutely I’m happy to comment on that.

      I work in health research too and oftentimes on large databases of patient information. By law, these data have to be “anonymised” to a reasonable degree even on our work computers or when shared between researchers.

      Generally speaking, for most databases that are gathered for a clear focus of research, just stripping out patient name and address will mean that no-one can be recognised in the database - because there is just not enough info…

      Read more
    2. Michael Galsworthy

      Senior Research Associate in Health Services Research at University College London

      In reply to Michael Galsworthy

      Just to follow up on that – for large information-rich patient databases, there are two general solutions that I am aware of.
      Firstly, you can break it down into different databases that address different questions (none of the databases having enough information to identify individuals). However, then you can’t link those databases for reasons of preserving anonymity. That means that you also limit the research questions you can ask.
      Secondly, you can drop the idea of an open database altogether…

      Read more