Recent outages prove need for transparency around internet infrastructure

In his 2001 book, Fooled by Randomness, the former risk analyst and financial options trader Nassim Nicholas Taleb observed that an entire collection of thought can collapse if any one of its fundamental assumptions is disproved. As one example, he noted that for at least 13 centuries, Europeans had believed all swans to be white, since all known European historical archives recorded that swans always have white feathers. However, in 1697, a Dutch sea captain Willem de Vlamingh, on a rescue mission searching for survivors from a compatriot ship lost two years earlier, explored a river on the coast of "New Holland" and was astonished to observe black swans. He named the estuary Swan river, and today the Australian city of Perth stands on its shores.

Taleb’s subsequent 2007 book, The Black Swan, documents numerous examples of what he termed “black swan events”, each of which were outside the realm of historical expectations, then had considerable impact when they occurred, but which in hindsight might have been entirely predictable (and even avoidable).

On Tuesday last week, substantial disruption and damage was caused worldwide by a black swan event in the infrastructure of the internet. An unnamed customer of the American services provider, Fastly, in San Francisco, made a routine (and entirely legitimate) change in its choice of settings for its Fastly service. The event triggered a bug within the Fastly software which then for several hours crippled a considerable number of news web sites, public information web sites (including both for the White House and the UK government), and various ecommerce websites.

Subsequently last Friday, one of Fastly's chief competitors, Cloudflare – also based in San Francisco – failed for many of its customers in Los Angeles and Chicago, including for users of the popular messaging service Discord. Although service was restored reasonably quickly, the incident was a further reminder of the fragile nature of internet infrastructure after the Fastly event earlier the same week.

The internet has become a pervasive tool across much of the planet, at the heart of business systems and ecommerce, global news and social media commentary, health and public welfare. No lives are known to have been directly lost due to last week’s outages, but some ecommerce sites complained of damage to their revenues beyond just minor inconvenience. Fortunately, the failures were moderately short-lived but for smaller enterprises and start-ups who have come to trust internet infrastructure for both their sales and supplies, any lengthy loss of service would potentially be catastrophic given their relatively limited balance sheets.

Last week has shown that while the facade may be elegant, some service architectures may be but a house of cards

There has been a leap of faith in the last two decades, actively promoted by major internet service providers, to put “everything” into the “cloud” and so dispense with local computing resources and software applications run on-site within an organisation’s premises. Vendors advocate that their software be used on a subscription basis in the cloud rather than purchased and owned, and that computing should be an operational expense rather than capital investment. The implication is that system-wide failure by the major software vendors is highly unlikely. However, last week has shown that while the facade may be elegant, some service architectures may be but a house of cards. Third-party independent verification of the reliability claims and assertions from cloud vendors is currently limited. Has the time come for the internet industry to finally wise up?

Other, more mature, industries have developed practices to cultivate and disseminate best practice, to analyse, publish and so learn from incidents and accidents. The aviation industry established its first accidents investigation committee just nine years after the first powered flight, in an initiative taken in 1912 by the Royal Aero Club. The maritime industry has a long history of accident investigation, in part driven by insurers keen to understand how and why incidents had occurred. Ethics in the medical sector emphasise that in appraising a situation, first ensure that no (further) harm is done.

The investigation of incidents need not necessarily result in assignment of blame and liability. Rather, the emphasis is more frequently on identifying the flaws in processes which led up to the event, and auditing the preventative procedures which were supposed to preclude the incident from happening. The goal is not only to ensure that the accident is not repeated, but also to ensure that related accidents cannot occur anywhere in the future. Incident reports are published to partners and competitors alike and so made widely available across an industry. Whistle-blowing legislation is in place in many countries, and professional and regulatory bodies can add their considerable weight and power when appropriate. Despite change from continuous innovation and advances in technology, a learning culture emerges throughout the industry, as peer pressure and professionalism disdain any who are seen to repeat the well-publicised mistakes of the past.

It is time for the specialists in the internet industry to become transparent, open and professional. The rest of us have placed tremendous faith, confidence and investment in their artefacts and while mistakes may occur, reoccurrence is frankly no longer an acceptable option.