Scientific Data: Losing the trail

Written by Andy Prosper

The lost or inaccessible scientific data is a daily problem that has serious implications because without them we could find ourselves unable to trace the foundations of scientific knowledge.

The loss of data is a common thing. Someone who has used computers has, in all likelihood, experienced the experience of not being able to open a file because it is damaged or no longer compatible with current programs, or because it is stored on a medium that ordinary computers do not read. But it is ironic that at times when scientists specialize in big scientific data, many of the data they have collected is, in fact, inaccessible.

In the study “The availability of research data declines rapidly with article age”, a team headed by Timothy H. Vines analyzed a sample of 516 scientific studies with an age between 2 and 22 years and found that, in many cases, the scientific data – which we could call crude, that is, even without interpretation- that accompanied them were impossible to locate, either because the authors had changed their electronic addresses and could not be contacted or because the scientific data was saved using material technology that is now obsolete. According to the article, the probability of these scientific data remaining available falls by 17% each year.

Since scientific knowledge is based mainly on the possibility of replicating experiments or on the analysis of replicable scientific data, the loss of data, added to the impossibility of contacting the people who generated them, gives rise to the questioning of these results, which probably they have been used to generate other studies, which in turn are the support of others and so on. The materiality of the knowledge are the results of the experiments: scientific data and measurements necessary to make any type of analysis or to criticize a previous analysis. Then come the theories that are based on them, or the counter-hypotheses to change the direction of a theory based on the scientific data.

Findings like those of Vines highlight at least three major problems. The first is that when you can not access the scientific data, the knowledge that emerges from them remains, so to speak, “in the air”. We know that knowledge is cumulative, at least in most disciplines, but if the scientific data on which that knowledge accumulates no longer exists, what is it actually based on? You can not do without the scientific data.

The situation is reminiscent of the Hansel and Gretel story, in which Hansel leaves a trail of bread crumbs to mark the way back to his home, but his effort is useless because the birds eat the bread. Is it that, like the characters in the story, we can not go back on the path of knowledge? Do we have to generate all those lost scientific data that served as a support to generate conclusions and continue advancing in knowledge? Or should we blindly trust the conclusions they gave?

It is assumed that trust without question or without being able to reproduce an experiment is unscientific. It is the very fact that criticizes any dogma. That is why science can not allow its data to be impossible to trace, because in this way all the methodological faculties that prove that a knowledge is scientific would be lost.

Therefore, being strict, we would say that the results based on non-traceable scientific data are invalid, since it can not be corroborated. To give an example, if we did not have the real data that Galileo collected in his mechanics experiments, we would no longer be certain of his results, nor could we analyze his experimental work in depth, nor the theoretical work that emerged from his scientific data. Such the importance that these entail.

Vines and his team suggest that the scientific data be delivered along with the article to the body that publishes it, so that those responsible will generate a systematic support of the information they decide to publish. It is a solution that leads to the second problem: how to store thousands of scientific data that occur daily and how to expect them to remain accessible in a hundred or two hundred years. It is about finding a material support that makes this possible. In addition to the need to create or perfect the virtual supports that support all the scientific data, it would be good not to depend on just one, but to have one or several that do not become obsolete with time. Perhaps it is not unreasonable to think about counting, at least in special cases, with a material support such as paper. After all, scientific books from the past, like those of Galileo, are still preserved and treasured

A third problem is the public and the private. Many investigations are financed with private capital, so that the scientific data they produce would have owners who could decide, or not, to make it available to the public.

However, there are data that are obtained in research projects carried out with state support. Something that makes knowledge unique is the first to measure or experiment on a certain material. Once you have the exclusive knowledge, that is, that the results are published arbitrarily, should the scientific data on which they were based be made public?

Several steps have already been taken to make this happen, both by scientists with their own initiative and by governments. The anthropologist Lee Berger, author of an important study about the ancestors of homo sapiens, has opted, once published his results, to make his scientific data available to the public and let the community use them to check their analysis or make their own theories . This fact was very original at the time (2015) and is relevant to the case we are discussing, since it combines the possibility for anyone who wants, whether or not to trace the scientific data on which a scientist made his conclusions and, in addition, the makes available to the public on the network, marking a kind of scientific transparency. What Berger intends is to return the importance to the analysis and the theories that are made about the scientific data, and not to the exclusivity about them.

Following this example, for six years the Molecular Ecology journal asks its authors to make available the scientific data of the studies they send for their opinion. This indicates that it is already done in some institutions, but it should become a common practice, which avoids the loss of contact with the authors or the obsolescence of the material support, as mentioned in this article.

It would be desirable that, at least as regards research funded by the State, once the results and analysis have been published, the scientific data should be made public or stored in the repositories of the person who publishes or finances them. There are already several efforts to materialize this idea. The European Union has established that as of 2014, all articles made thanks to the financing of Horizon 2020 – the program for financing research and innovation in the EU – will have to be accessible. They propose two routes for this, the golden and the green. In the first one, an investigator puts his article in open access at the moment of publication in a specialized journal, which must be economically compensated by the cession of the copyright to leave the article for free consultation. In the second, the scientist deposits his already published article in a repository open to the public, sometime after its appearance in specialized media. In Spain and Mexico, they have opted for a scheme similar to that of the greenway. In Mexico, the laws of Science and Technology, General of Education and Organic of Conacyt were reformed in 2014 to adopt the open access strategy.

Finally, we can say that attempts have been made to make giant databases for these purposes, as it is told here. However, historically, scientists have objected and have been reluctant to share their scientific data: it is too much work to make it public, there are no good and reliable databases, those who finance research have no interest in sharing them. In addition, it is difficult to agree on the standards to format the scientific data.

Another problem is that the contextual information of the scientific data can lead to its fraudulent or erroneous use, disclosing certain results that seem supported by the scientific data, but actually doing a bad reading of them, not reviewed by the experts.

The loss and conservation of data is a problem in the day to day of scientific work, for which a completely satisfactory solution has not been found. It is a delicate and important issue, because it is the trace of knowledge, on which everything that is investigated is sustained. Those who can see further are not, as Newton said, “on the shoulders of giants”, but in a scientific data trail that grows day by day. Something must be done so that we can always follow it.

About the author

Andy Prosper