Amazon goof shows perils of backhoe in the clouds

Placing your data in the cloud means entrusting your data and computer power to a third party

Placing your data in the cloud means entrusting your data and computer power to a third party

THE ARCH-ENEMY of internet system administrators used to be the backhoe, the American name for the mechanical digger used to excavate pipes and, occasionally, sever fibre-optic cables. One miscalculated slice through an important internet connection and whole swathes of internet sites vanish.

These days, the underlying net infrastructure is a bit more resilient (with a few exceptions. A month or so ago an old lady digging for scrap metal accidentally cut off most of the Armenian internet). But mistakes can still be made.

On April 21st, the goof was Amazon’s. As well as selling books and DVDs, Amazon has a side business renting companies computing power and data storage in its highly connected operation centres. Just as Amazon pioneered simple, one-click purchasing for physical goods, it has also pioneered one-click computing. With just a few commands, developers and network administrators can create one or 1,000 clones of a complete computer system, running their own code in the comforting and high-bandwidth surrounds of Amazon’s data hotels. If you’ve heard of the phrase “in the cloud”, chances are it is referring to this deputising of Amazon’s networked power to serve other companies’ needs.

READ MORE

Mostly, corporations that outsource their computing, database management and storage to Amazon do so invisibly. But for a few hours last week, you could identify at least a few major websites that depended on them. Popular sites such as Reddit, Quora and Foursquare all stopped working as Amazon lost, and struggled to restore, control of one of its internal networks.

The problem wasn’t physical: no cables were cut, no blackouts took place. Instead, it was a slip of the finger by one of Amazon’s own system administrators. While upgrading part of Amazon’s internal systems, the operator accidentally diverted an unexpectedly large amount of data traffic onto one of the slower networks that Amazon uses internally.

Ironically, that slower background network is meant to improve the reliability of Amazon’s storage service. Amazon’s computers always try to ensure they are in contact with, and duplicating data with, another computer in its hive of redundant machines. When the connections between these computers became blocked in data, the computers started looking for other possible partners to pair with. The resulting confusion filled the back-channel with even more chatter, bringing everything in that part of Amazon’s system grinding to a halt.

Amazon’s systems have more than one aspect devoted to reliability and redundancy. The failure affected, they claim, just a few of their users. But those users included some major start-ups, many of whom depended on Amazon for their entire business.

It’s not as if companies using Amazon weren’t warned. The computing provider makes it clear that if you want absolute reliability, you will have to work on that yourself. By and large, the services that failed were those that ran their systems in only one of its regional centres. Amazon has a number of networked computer centres scattered across the world – mainly in the west and east coast of America. Amazon makes it clear if something terrible happens on the east coast, and all the machines you use are based there, your services will be interrupted. You can rent storage and computing power on both coasts, but replicating that data cross-country costs money. It’s a lot cheaper to keep all your systems under one roof, but there’s a cost in dependability.

In fact, Amazon tries to make computers that are in the same centres reasonably isolated from each other: it offers “availability zones”, which should be logically if not physically separate on Amazon’s networks. Amazon’s snafu affected not just one availability zone, but also zones in the same east coast region.

You could argue that while “availability zone” is intended to improve Amazon’s service, it might have misled some to think they could put all their eggs into one slightly segmented basket.

Amazon has renewed its commitment to making its availability zones more independent from each other. But should its users be looking to be more independent from Amazon itself? Trusting your data in the cloud means entrusting your data and your computer power to a third party. But that’s something almost all of us do, whether it’s to our local IT staff, or contracted out to a consultancy.

The difference with the cloud is your relationship with that third party. Amazon deals with hundreds of thousands of clients, and provides exactly the same service to them all. It’s commoditised computing power.

There’s no individual care, and no direct and major consequences if you move away or grow dissatisfied. That’s a world apart from a team working for you as a direct client, or employee.

Amazon’s dominance of the low-end, start-up sector of the market means it’s very hard to sack them. But there are plenty of emerging competitors to Amazon, from agile small companies such as Rackspace to established players like IBM and ATT.

New companies need to start considering whether they need to spread themselves across geographical regions to protect themselves from the new backhoes – or spread themselves across competing cloud computing providers.