Behind the scenes of Dublin academia, a vast digitalisation of historical documents is under way. It may well raise the bar for archiving, writes KARLIN LILLINGTON
AT TRINITY College Dublin, 1641 is coming face to face with 2009, to the benefit of scholars but also the public.
A set of priceless manuscripts – aged, coffee-coloured and covered in the looping, spidery scrawl of many different 17th-century hands – is to emerge as a cutting-edge semantic web and documentation project accessible by all online.
The manuscripts are the famous – or, depending on your view, infamous – 1641 Depositions, eight volumes of 1,559 personal accounts of one of the most violent moments in Irish history, the 1641 Rebellion.
The depositions – witness statements taken mostly from Protestants – describe incidents of murder, rape and pillage but also many aspects of everyday life in the 1600s.
This rich material is valued by social, religious, historical and political researchers but is also the source of hundreds of years of bitter dispute over the truthfulness of the events described.
For many Northern Protestants, the depositions are part of a sense of national identity and they are as symbolic for the Protestant community in Ireland as the battles of the Boyne and Somme, according to the project’s website, available at www.tcd.ie/history/1641.
“The 1641 Depositions are a major treasure, not just for Ireland but internationally,” says TCD history professor Jane Ohlmeyer, one of the principal investigators on the project.
“So we wanted to make them available to a wider audience.”
TCD has collaborated with Aberdeen University and Cambridge University for the project. IBM Ireland has been the main technology partner, with DJ McCloskey of the LanguageWare Group at IBM’s Dublin Software Lab the contact point.
Funded by €1 million in grants from the UK’s Arts and Humanities Research Council, the Irish Research Council for the Humanities and Social Sciences, and the TCD Library, the deposition project is scanning 19,000 pages of text from the depositions to produce a high-resolution image, which is then transcribed by hand, a stage which is close to completion, thanks to the work of three post-doctorates.
“These are very, very noisy texts, so they’re being transcribed by three PhDs. No technology is going to be able to do that, and the level of historical expertise it takes to do these is quite extraordinary,” he says.
Once transcribed using Text Encoding Initiative (TEI) guidelines, an agreed format for digitising documents so that they may be readily used and exchanged by scholars and libraries, the texts are marked up using IBM’s LanguageWare software in XML (Extensible Markup Language), a set of rules for creating “metadata” by tagging data so that tagged objects are readily identifiable.
Such work would be impossibly time-consuming if tagging was done by hand, says McCloskey.
“There’s a social network within these depositions themselves. We’re building up a semantic network – a Web 3.0 view of that world,” says McCloskey.
Ohlmeyer says this is a first for the humanities – LanguageWare has been used in the health sector and law enforcement, but not in this way.
IBM was interested in the effort because its researchers felt that if LanguageWare could tackle texts this complex and archaic, it could manage just about anything, says McCloskey.
“From a purely technical challenge view – which of course drives us – this is a perfect project. If we can analyse this, well, you can pretty much analyse anything.”
Ohlmeyer agrees: “We have all this dirty data that makes a great sandpit for them to play in.”
The fact that IBM’s LanguageWare group is based out of the Dublin Software Lab made a project with such significant Irish historical ramifications especially meaningful for researchers, McCloskey says.
The digitisation has made the content of the depositions available to researchers in some startling new ways, says Ohlmeyer, revealing fresh, previously unseen patterns and relationships.
“Once data is marked up in TEI, the possibilities are just endless.
“For example, you can link an atrocity to a person to a place. We hope to do a sophisticated linguistic analysis on the depositions, and also are looking for grants to enable the data to be mapped.”
The first tranche of data, from Ulster depositions, will be available next month, with the remainder to follow early in the new year.
Ohlmeyer sees the project, the first of its kind, as revolutionising research in the humanities because such large-scale views of complex documents have never been available before.
To that end, the research partners have applied for three further grants “to start creating a generic tool for humanities research and commercial applications”, she says.
“By working with industry, we’re creating tools that we could roll out to wider communities to work with texts. They could be of huge interest for museums, for galleries, for libraries, social scientists, lawyers and many groups.”
She notes the tools are open-source-code-based and therefore would be freely available, although there is also the possibility of creating IP (intellectual property) for Trinity and perhaps commercial possibilities that could flow from that.
“But in terms of what’s driving it from our end, it’s pure research,” she says.
“What was clear to me is this kind of overlap between the humanities and technology has not really been happening.
“This project is truly interdisciplinary and multidisciplinary.”
The depositions project could become a marker for future directions for digitising resources and doing research, she believes.
“If we get this right, there’s a paradigm shift here.”