Valuable information relating to recent events in Ireland could be lost to future generations because of a failure to properly collect and preserve online media, a leading archivist has warned.
Ger Wilson, head of digital collections at the National Library of Ireland, said that with its research showing that as much as 50 per cent of website content can disappear within a year, it is "highly likely" that some critical material has already disappeared.
She was speaking following the issuing of a tender notice by the library to carry out an extensive crawl of Irish-registered domains later this year. This is part of an attempt to archive the Irish web so that historians of the future will be able to see what the local internet looked like in 2017.
Web content is by its nature ephemeral with websites disappearing completely or being frequently updated, with the result that valuable information can easily be lost.
‘Digital dark age’
Vint Cerf, one of the founding fathers of the internet due to his role in developing TCP/IP protocol technology, recently warned while on a visit to Dublin of a “digital dark age” for humanity because of the failure to preserve digital content.
Rather than taking a simple snapshot, the proposed crawl is intended to archive entire websites so that future generations in Ireland will be able to access historic sites just as they can be used now, with working hyperlinks and so on.
The National Library, whose mission is to collect, preserve, promote and make accessible the documentary and intellectual record of the life of Ireland, undertook a similar crawl in 2007, but this has never been made publicly available.
Last week it issued a new tender notice seeking a technical partner to help it carry out a new crawl, which is expected to be completed by late November. The crawl will capture the more than 230,000 websites which belong to the .ie domain as well as other sites that can be identified as being hosted within Ireland and others that are considered to be of Irish interest.
Speaking to The Irish Times, Ms Wilson said the library was concerned about historical online material from the last 20 years having disappeared and was taking steps to ensure that a record was available of Ireland's internet landscape.
"We collect all paper material published in Ireland under legal deposit and have done since the library was first established but there is no legislation covering digital, although we are working with the Department of Jobs, Enterprise and Innovation, which has responsibility for copyright legislation, to address this," Ms Wilson said.
She said about 60 per cent of national libraries in Europe had legislation enabling them to carry out national domain crawls. “We’d like to be in this situation at the very least because websites are so vulnerable. Some institutions, such as the Royal Library in Copenhagen, carry out four crawls a year.”
Ms Wilson said concern over the loss of online material led the library to start doing selective web archiving on key events in Irish life such as elections and referendums since 2011. However, it has to seek permission from individual websites to carry out such work.
Some of this collected material is available to the wider public while plans are also afoot to make the 2007 archived crawl and the soon-to-be carried out one available to those using the library’s reading rooms.
Ms Wilson said that while crawls can be successful in capturing key information, such valuable data such as embedded video can be problematic. In addition to archiving website information it also selectively captures social media such as tweets.
A potential concern for archivists is the introduction of the general data protection regulation, which is due to come into force next May.
“The legislation is a concern for the archive community because while overall it is a positive development we feel there should be derogation for archival collection of material that is for the public good,” said Ms Wilson.