When it comes to scientific data, sharing is caring

There is no doubt that funding scientific research is costly. The Hubble Space Telescope cost US taxpayers a staggering $1.5 billion to build, making it one of the most expensive pieces of scientific equipment ever made.

Of course, the fact that Hubble has added to our knowledge of the universe is payment enough; since its launch in 1990 Hubble has helped scientists calculate the age of our universe, advanced our understanding of black holes, and sent us breathtaking pictures of the birth of distant stars.

According to Dr Ross Wilkinson, executive director of the Australian National Data Service, Hubble is also one of the best examples of a healthy return on investment in scientific research.

This is because raw data from the telescope has been available for the past 20 years to whoever wants it, resulting in an archive that generates $1 million (€728,000) per annum in revenue.

Having an open data policy has, in other words, doubled Hubble’s return on investment.

This example from Wilkinson perhaps best illustrates the reason why 477 academics and policymakers from around the globe gathered in Dublin last week for the Research Data Alliance’s Third Plenary Meeting.

Rather than being a niche gathering, it was a place for people who "care about how the sharing of research data can progress to discoveries that have the potential to be of benefit to all," said Dr Ruth Adler, the Australian ambassador to Ireland.

Research data is costly enough and difficult enough to generate in the first place; having it sitting forgotten on a hard drive somewhere, never to be shared, is not only careless but also not in the spirit of scientific endeavour.

A 2013 study found that the amount of scientific data being generated is increasing by 30 per cent year on year and, worryingly, 80 per cent of all scientific data is lost within two decades.

Reproducibility
One of the reasons it is so important to prevent this data loss is reproducibility, says Prof Alan Smeaton, director of the Insight Centre for Data Analytics, Dublin City University.

“A researcher’s data should be publicly and easily available. It doesn’t matter whether you’re in astronomy or the arts – you use data.

“What you, as a researcher, want to happen is to have other people build upon your work. It’s an unnecessary burden if they have to go recreate all of that data.”

To paraphrase Isaac Newton’s famous utterance, Smeaton explains how science works: “We build on the shoulders of others by using their data.”

Data sharing is also important to the public, explains Smeaton, because, for the most part, science is funded by the public purse, so the public should demand and get easier access to research outputs. In terms of publications, this is now becoming standard as mandated by funding agencies.

A scientific paper is only the tip of the iceberg in terms of what is visible of a researcher's work to the outside world, says Mercè Crosas, director of Data Science at the Institute for Quantitative Social Science (IQSS) at Harvard University.

“The researcher’s output in terms of contribution to scientific progress is much bigger than this. The data and analysis associated with those claims are fundamental to understanding what the paper in a sense advertises,” she explains.

In the absence of sharing data and methodology, it's hard to validate or understand scientific works.

Guidelines
To this end Crosas's team at the IQSS provide data publishing guidelines and create tools for scientists to access and re-use existing research data.

According to Crosas, to some extent data collection, preparation, cleaning and analysis is 80 per cent of the work and not being able to re-use and value that effort “would be a shame, right?” he says.

Scientific data for the masses isn’t an altruistic gesture; it translates directly to innovation and the potential for industry collaboration.

Dr Sandra Collins, director of the Digital Repository of Ireland, gave a concrete example of why industry should care about publicly available scientific data.

She explained that existing datasets could be used right now by technology startups interested in creating useful smartphone or web apps.

Prof Mark Ferguson, director general of Science Foundation Ireland and chief scientific adviser to the Government, addressed the conference with some attractive statistics on why data sharing might appeal to the researcher who is covetous of their data.

The most successful research papers, he says, in terms of citations by other scientists, are those resulting from industry collaboration as well as national and international collaboration.

Papers arising from data siloed within a researcher’s own department were far less likely to succeed, a compelling argument to all academic for embracing open data and collaboration.