Archiving the Net: ‘Preserving the web isn’t impossible’

Our experience of modern history is informed as much from news reels, magazines and personal journals as it is by the physical artefacts we see in museums.

Imagine our understanding of the second World War without footage of Adolf Hitler delivering spittle-flecked speeches to large crowds in Nazi Germany, posters from the British Government urging the public to "make do and mend", or Anne Frank's diary?

How will historians of the distant future make sense of Ireland's vote to repeal the 8th amendment, Brexit, or Trump's America? They will wade into tweets, slog through YouTube and trudge through news websites. These digital expressions are a window into how we live.

And it feels as though they will forever be there for us to sift through but it is not unthinkable that Facebook or Twitter should suffer the same fate as MySpace and Geocities, two enormously successful social media platforms that, despite millions of users at their peak, faded away into irrelevance.

‘If you did a thought experiment and could get the first rudimentary online output of an Obama or an Einstein or a Bob Dylan, wouldn’t you want to have that, even if it seemed mundane?’ says Brian McCullough.

Data

Even with the best of intentions these companies can lose valuable data. Earlier this year MySpace admitted that all photos, videos and music uploaded by its users between 2003 and 2015 had been accidentally deleted in a botched server update. There goes 12 years of primary sources on noughties youth culture (well, six years really, because no-one has really used MySpace since 2009).

The MySpace server snafu is a reminder that we are entrusting our memories to private companies even though it feels like our photos and conversations are sitting in a shared public space. So how do we ensure that our collective internet history is saved for posterity? And why is it important that we should do so?

"Preserving the web isn't impossible. Yes, it feels like there is a metric ton of ephemera posted to the internet every single day. But who knows what tweet or blog post or Instagram comment will be the moment that captures a future president of the United States at the very moment that her political views were being formed," says Brian McCullough, host of the Internet History Podcast and author of How the Internet Happened: From Netscape to the iPhone.

McCullough is passionate about preserving internet history and has even given a TED talk on the cultural value of social media for future historians. During the talk he shows the audience a dated looking webpage that once sat on web hosting company Angelfire’s servers.

The rise of the internet is either the greatest boon to history and preservation, or it is a potential new dark age where we will lose an entire era of context

It appears to have been made by a then 15-year-old Mark Zuckerberg. Amongst the cheesy content is a Java applet displaying a rudimentary social graph he was working on. A snapshot of history in the making.

“I guess all of my work is because I was the perfect age to come of age when the web and the internet went mainstream. There are historians who focus on preserving the punk era of the ‘70s because that is when they came up. Or obsess over the automobiles that were the cutting-edge designs of the ‘50s.”

“Maybe all historians are narcissists,” muses McCullough, adding that his goal is to capture and preserve the experience of a nascent internet while helping tell the story of how we got to the internet of now and who got us there.

"If you did a thought experiment and could get the first rudimentary online output of an Obama or an Einstein or a Bob Dylan, wouldn't you want to have that, even if it seemed mundane? The thing we fear about social media – that our unfiltered or undressed-up impulses and thoughts are out there – think of what a Godsend that could be for historians."

The Wayback Machine

There are several projects that recognise this opportunity. The Internet Archive is the biggest and best example: this San Francisco-based non-profit organisation has been saving the web since 1996. They are most famous for the Wayback Machine, a search engine that has saved over 351 billion webpages to date.

Now defunct websites and older versions of current sites are available to peruse, and they encourage all web users to take part. The Wayback Machine plug-in for the Chrome browser allows for both saving pages and viewing older versions.

Their efforts are of particular significance this month as Google shuttered its Google+ platform. It may not have been in the same league as Twitter or Facebook but it still represented a significant chunk of user content that is now deleted for good including reader comments from countless websites and blogs due to its doubling up as a commenting plug-in.

Google offered users the opportunity to download their own personal data but it did not keep any this content itself. However, Archive Team – a voluntary collective of historians, technologists and archivists from around the globe – began archiving efforts for public G+ content a while ago.

The Archive Team, according to a Reddit post by user Reddit user dredmorbius, reported that as of the April 2nd shutdown, 98.5 percent of listed profiles and over 90 percent of G+ communities had been saved.

This content is expected to go live on the Wayback Machine within a few weeks, representing the largest single project completed by team to date. These voluntary efforts funded by charitable donations are admirable but shouldn’t individual companies consider the future value or archiving portions of their users’ content?

“Any tech platform should set aside a certain budget – it can be infinitesimally small, given their turnover, but they should have to allocate for it, to provide for the preservation of content on their platforms. Moore’s Law makes this exceedingly affordable over time,” says McCullough.

But we must also remember that from a legal perspective these (mostly free) social media services don't have a duty to preserve our digital memories. By using them we also agree to terms and conditions that include the loss of our data if the service shuts down. Case in point: Geocities, Bebo, MySpace, Storify, Google Video, Twitpic, Vine, Yahoo! Podcasts, del.ic.ious, the list goes on and on.

The tweets of US president Donald Trump, before and after his election, could prove to be important historical resources. Photograph: Nicholas Kamm/AFP/Getty

Terms and conditions

The question is not about who really owns the user content on all these sites; the question is what the user is paying for, says Dr Rónán Kennedy, lecturer and researcher with the Faculty of Law at NUI Galway. Are we paying for long term storage or is the service even promising this?

“Some of these online social media sites like to give the impression that they are essentially public spaces, but they are not. They are private spaces and entry into them is governed by contract. I do completely understand that there is this balance in terms of bargaining power between the end user and the big multinational social media company but that’s the way contract law works,” explains Kennedy.

“You’re assumed to have read the terms and conditions. Not that I am not sympathetic: these users, myself included, have clicked on a box that says ‘I have read and understood the terms and conditions’ without actually having read and understood the terms and conditions. And many of these services are free and there is an element of ‘you get what you pay for’ at the end of the day. And for people who have lost data important to them this is difficult.”

Going forward, it could become trivial to produce and store. Let's lay the groundwork now

The internet, however, is still very young as a medium, explains Kennedy: “We’re just learning what works and what doesn’t. I do wonder if in five or ten years’ time if we will still be relying so much on free services because – indirectly related to what we’re talking about – there has been a lot of scandal lately around how social media data is used and abused.

“I think we’re beginning to look at all this now and think, well, it seems like it’s free, but there are other costs that are not financial. Whether or not that will lead to a reconsideration of ‘free’ as a model for the internet, I don’t know. I like free, everyone likes free. I think we’re starting to take a harder look at what it is we are getting for that ‘free’ .”

But this is my data, these are my precious memories, you may well think. However, ownership of these photos, videos and social media posts does not mean you are entitled to have a social media service store it for you indefinitely. These companies do not have a legal obligation to return content or save it, clarifies Kennedy.

Legalities

Data protection law might offer some recourse for deletion of personal data because this is a form of processing – and if they are deleting data with your consent you might be able to prevent this, he explains. But this detail will change from service to service and depending on the terms and conditions, different platforms will set down different rules. And there is the time and cost involved for an individual to fight such a case and lodge a complaint with the data protection authority.

“In theory there are remedies under data protection law and if you can prove it is personal data – photos of you and your family – you might be able to use this law but for ordinary individual I’m not sure how realistic this is.”

Of course, if digital preservation were to be incorporated into data creation it would solve these problems by automating archiving of web content, explains McCullough.

“The internet was built on open, open-source protocols. Why not create some sort of rudimentary-level protocol of data (low cost and low bandwidth and data intensive) that could underlie all contemporary data production? A sort of carbon copy of everything we produce?

“It would be a base level of digital formatting that, in theory, even an alien civilization 1,000 years from now could figure out how to parse and create formats to play. Again, going forward, it could become trivial to produce and store. Let’s lay the groundwork now.”

This notion of making our internet data readable to alien civilisations in the distant future is admirable but right now, we’re still fighting digital obsolescence within a few decades. The best example of this is the Domesday Project initiated by the BBC in the 1980s.

To mark the 900-year anniversary of the original Domesday Book, a census of 11th century England, the BBC gathered information on UK citizens between 1984 and 1986. This was placed on LaserDisc readable only on a specific BBC computer. By 2002, a preservation project had to be carried because it was on track to becoming completely inaccessible. Meanwhile, the original Domesday Book – captured on low tech paper – is still perfectly legible.

McCullough observes: “The rise of the internet is either the greatest boon to history and preservation, or it is a potential new dark age where we will lose an entire era of context. It’s our time to choose. Right now.”