There’s gold in them there tweets

The private road up to IBM's Almaden Research Centre, out beyond the edges of the last suburbs of San Jose, offers unexpected vistas that let a visitor imagine a California of the past, long before it became the 31st state in 1850.

But a visit to the facility, one of 12 IBM labs over six continents, reveals research on the cutting edge of the future. Much of the work here is "blue sky" research, where staff – including eight fellows, eight distinguished engineers, 13 master inventors and 12 members of the IBM Academy of Technology – work at imagining a technological future at least five to 10 years, and sometimes decades, ahead of now.

“It’s one of the last real industrial labs, so we can take a longer view,” says research staff member Eben Haber. “That kind of timetable is a real luxury in understanding very hard and complicated problems.”

He has a particular interest in the point where people interact with technology, not just technology itself. “Many times, the problem is not, ‘can I do something’, but ‘does it do something for people?’” he says.

A personal fascination is how people use social media, and what can be learned about an individual from their social media posts, the language they use and the pattern of their activity.

A practical application is to help companies understand and market to their customers – or potential customers – better. But when such information has to be gleaned from terabytes of data, the challenge is difficult.

Haber has been doing research that couples large-scale data analysis of posts on the social media site Twitter with established research in psycholinguistics that can discern personality types from the kind of language an individual uses. This could be useful to marketers, as responses to online marketing campaigns are far lower than direct mail or phone campaigns.

With $170 billion spent annually on direct marketing in the US, it is potentially quite a breakthrough – and not just for advertisers. It can be a wake-up call for individuals to understand better the value of their personal data, he says.

“It’s possible to learn a lot from tweets,” he says, especially when they are compared with past data, a Twitter user’s profile information, and other seemingly innocuous details spread around social media.

His experiment, which mined three months of tweets, identified 90 million distinct individuals, of whom 15 million were recently active on the site. The goal was to sift through all those individuals to try and find people who might be interested, say, in taking out a mortgage.

Of the tweets flagged as most likely to indicate such an individual, some 50 per cent turned out to identify people who, on close examination, clearly were interested in buying a house, he says.

“On the whole, this was remarkably precise in trying to find the right people,” says Haber, especially when direct marketing across all mediums, tends to be scattergun, with a measly 2 per cent success rate.

His research group used Twitter “because it is the most open of the social media sites” – most tweets are readable by anyone – and also because it is easy to get data for research. Twitter makes it possible to gather tweets directly, and there is a whole market of resellers who package very large Twitter data sets, he says.

But Haber does not just view these potential marketing capabilities from the perspective of a an advertiser. He is just as interested in the larger questions – and some of the concerns – that this type of big data analysis exposes.

He avoids using social media himself and notes with bemusement that he is not sure why people are willing to be so open on social media sites. Even something as innocent as a publicly posted birthday wish reveals information useful to marketers – or identity thieves, for that matter, he notes.

He is not sure whether Twitter users will be alarmed or indifferent to having their data mined in this way for direct marketing.

“We’re doing some experiments right now to understand how people feel about it. I think there’s a lot of issues, of people not realising how public some of this is. People give up a lot of information for a free service [such as Twitter or Facebook].”

But that isn’t necessarily the exchange people will want, or that services will be on offer, in future. “That’s the dominant [business] model right now, but I can foresee better models. For example, you might give your data to a trusted party and you’d receive some portion of the proceeds” that marketers would pay to gain access to your data.

“When it comes down to it, I think people don’t realise what their data is worth. It’s a discussion that needs to be had,” says Haber.

If the techniques he is exploring really work, there would probably be a value for social media users too, he argues. If marketing pitches could be more finely tuned, “you’d receive 10 or 20 instead of thousands” of emails or other online marketing appeals, and more appropriate advertising than, say, the ubiquitous diet tip ads.

While the short-term purpose of his group’s work is to support customers, the longer-term objective is also to do some deep thinking about how technology can support the use of data in places where privacy laws are much stronger than they are in the US (as in Europe).

Haber also wonders whether technology could be used to avoid exposing what type of person you are by the words you use – a kind of psycholinguistic scrambler, perhaps. He is also doing an experiment to see how many and what type of responses he gets when sending out tweets to unknown people. A software program that automatically sends out tweets to thousands of people identified through the same big data analysis as being interested in travel to a particular destination. The goal is to attempt to crowd-source recommended sites for travel information.

“I’m surprised at how many people will respond to an unknown entity on Twitter,” he says.

But what surprises him even more “is just how much you can learn from data. From a little piece of information, you can infer a lot.”