The Hiberlink project builds directly upon a pilot study from Los Alamos National Laboratory (LANL), powered by their Memento Time Travel for the Web technology presented at OR2011 and available at http://arxiv.org/abs/1105.3459.
This study confirmed that as much as 30% of the http:// links in a selection of 400,000 arXiv.org papers did not function. 8% had been archived but 22% had not and were therefore lost for good. A startling 46% of resources still available were not archived, and hence in danger of disappearing without a trace.
Using the text mining and information extracting tools by the Language Technology Group (LTG) at the University of Edinburgh, School of Informatics, the project will examine a vast corpus of online scholarly publication in order to assess what links still work as intended and what web content has been successfully archived and therefore preserved for use by future researchers and students.
The ‘reference rot’ problem has two aspects. First, the http:// link that references a resource may no longer work. Second, the content at the end of the link may have evolved and may even have become dramatically different from when originally referenced. So when eventually an online scholarly work is a revisited and its references are double-checked by a researcher in order to confirm evidence, to establish context, to inform policy and decision making or for any other practical purpose, then the original information on websites or in online databases may have changed or even ceased to exist.
Our research will deliver a variety of quantitative characterizations regarding the extent to which the context that surrounded a scholarly paper at the time of its publication can be recreated later on and to which extent the ability to do so is dependent on a variety of properties of the publication venue, the publication, and the cited resources.