Real-time data crunching may be about to take the net by Storm

WIRED:The source code of Storm analytics software could help drive a revolution in how we manage web data

MOST PEOPLE I know are shocked and upset when Twitter is down. The “fail whale” – the graphic of a whale being held up by small birds that Twitter offers by way of cutesy compensation – is no recompense for the sudden inability to be distracted by friends and celebrities on demand, it appears.

By contrast, I myself am disturbed at the thought that we live in a world where Twitter remains online hour after hour, without missing a beat or omitting a single 140-character missive.

It’s not the triviality of the messages that makes me marvel – it’s the sheer enormity of the task Twitter has taken on. Millions of simultaneous users, each of whom may have thousands of followers, sending and searching and hitting reload on webpages constructed from dozens of bits of independent chunks of data, updated in real time.

This isn’t a project on the scale of a few weeks’ coding jaunt (although it began as one). It’s more like the infrastructure of a financial-services market, running 24 hours a day, seven days a week, but without the budget, regulatory oversight or subsidies. And it’s offered, in effect, for free for anyone to use.

This isn’t a trick that only Twitter manages to pull off. Other startups live off the data that Twitter, Facebook and other fast-changing services emit, processing and analysing it as quickly as they can gather it.

Tricks for handling torrents of information like this used to be thin on the ground. You would have to be working somewhere like Nasa or Google even to begin to understand the challenges (and opportunities) of “big data”. These days, there’s no shortage of data and, thankfully, the modern programmer has been handed at least the basic tools to deal with it.

For instance, much of the underlying structures of the modern real-time web emerged from papers published by Google developers over the past 10 years.

MapReduce, Google’s inhouse system for managing large amounts of data over many distributed computers, was a revelation when it was first described. It put forward a simple system for batching together, calculating and then collecting data on a scale that even a few years previously was seen as a narrow, almost academic problem. Crunching terabytes of information over thousands of processors seemed like an impossible challenge of organisation and planning.

Google’s engineers boiled it down to a simple system, which almost any coder could pick up and start playing with.

Companies like Yahoo have built on those ideas and in turn made their contributions open source for the wider community, primarily as Hadoop, an open-source project that took MapReduce and made it a widespread commodity that anyone could use.

These days, it is companies like Twitter – and the ecosystem that surrounds them – which are pushing these frontiers of common knowledge.

Last month, Nathan Marz, whose company BackType was bought by Twitter in July for an undisclosed sum, released the source code to Storm – the software that had driven his Twitter analytics website, BackTweets.

Storm makes its predecessors’ ambitions seem positively languid. Hadoop and MapReduce take a pile of data and slowly grind it into a stack of calculated results. They are the kind of software that can take Google’s stockpile of pages from across the web and turn it into an index you can search.

Storm is about using the firehose-stream of data that Twitter generates and coming up with results as quickly as it is generated.

I’m not sure if many will take Storm and immediately build their company on it, as happened with Hadoop. It’s written in a slightly obscure language and still awaits a community of developers that will make it safe from decay, now that its main developer, Marz, works for Twitter.

But what it does provide is a set of terms to describe the manner of the problem that Marz and many like him are trying to solve in companies around the world.

A decade ago, few of us would have had enough data to worry about how to distribute calculations across hundreds of computers. Three years ago, many of us would have had that much data, but few of us would have cared about doing anything with it in seconds rather than hours.

In the next few years, Twitter and a large chunk of corporations and individuals will want real-time data. And we’ll either be writing code to handle it ourselves or paying someone else a handsome figure to do it for us.

Storm, like Hadoop and MapReduce before it, is not exactly breaking new ground in the academic or scientific understanding of how to handle big data. But its appearance as a complete, usable open-source product signals a move from today’s world, when companies like Twitter and a few smart people like Marz understand how to deal with that data in real time, to a time when no one will have an excuse not to know how to do just that.

Right now, Twitter seems a singular data-crunching miracle to me. If the ideas behind Storm spread quickly enough, within a few years, it’ll be one of thousands of similar real-time miracles.