Finding the hidden patterns in big data

From medical research to reducing retail food waste, analysis of huge data sets has invaluable applications

At IBM’s Almaden research centre in the hills just outside of San Jose, big data is a big part of current research.

While much of the work that happens at Almaden is "blue sky" research that looks far into the future, many of the big data projects, involving collaborative work with current IBM clients, are likely to have more immediate applications.

One such project hopes to find ways of better managing outbreaks of foodborne illness. Working with the German Federal Institute for Risk Assessment, researchers have combined data on the sale of food with public health case records.

Theoretically, the data is available to set up a monitoring system, as outbreaks are tracked in most countries, and food distribution shipments are also traceable thanks to sensors and tagging systems.


"The idea is that a lot of data already exists on public health problems," says James Kaufman, manager of the public health research project in the department of computer science at Almaden. Working with existing sources of data, and combining those large data records in new ways, can give invaluable insights into outbreaks of serious foodborne illness, he says.

The data could even reveal a pending outbreak before it actually occurs.

His group showed that by mapping monitored outbreaks across a country, and also monitoring the sale and consumption of food, the outbreaks themselves can be effectively monitored. The likelihood of a given food causing an outbreak can also be estimated, helping scientists to figure out the exact source of an outbreak more quickly.

But just as important, knowing which food is at the centre of an outbreak, and knowing its source, means less uncontaminated food is removed from store shelves and discarded.

Kaufman cites 2011 data that indicates that the cost of medical issues from contaminated foods was $9 billion – but the cost to retailers, in good food that either was dumped or remained unsold due to consumer fears, was $75 billion.

Saving lives
There's also a clear public health benefit in illness avoided or treated swiftly, and lives saved, as well as a financial incentive for industry given costs and the risk of unwarranted reputation damage (think of the initial panic during the recent horse meat scandal in Europe).

But setting up such a monitoring system would need co-operation that currently does not exist.

“We’re suggesting a public-private partnership of private food producers and public health departments, working together,” Kaufman says.

Meanwhile, a colleague is taking a different approach to another health issue – using big data for textual analysis on a massive scale to help accelerate pharmaceutical research.

"Developing one drug is a 10- to 15-year cycle," says Ying Chen, a researcher, master inventor and manager in IBM Almaden services research.

“They say developing drugs is like casino gambling. The blind alley is a big problem in this industry. The drug failure rate is around 90 per cent.”

But put the concentrated computing power of Watson – the IBM supercomputer that beat two human opponents in 2011 on the US gameshow Jeopardy! – to sifting through the text of 20.6 million Medline medical journal articles and 16.2 million patent applications, and the odds of finding the best candidate drugs and therapies could shorten considerably.

From this information, IBM has extracted searchable bits of information called “semantic entities”. These include 12 million chemicals, 7,000 diseases, 22,000 drugs and 130,000 genes.

Textual analysis
The same textual analysis abilities that enabled Watson to understand and swiftly answer natural language queries in a gameshow – and which IBM is working to commercialise for products and services – lie behind Chen's research.

Information is extracted from the articles and made searchable in ways more complex than, say, a Google search. The result is a kind of a LinkedIn for chemical structures, genes, and proteins, she says.

“Then we bring in other data clients might have,” such as human case records, details of existing pharmaceuticals, and other scientific work.

“There’s a whole process with different sets of users and scientists generating different types of information, that could give information to influence the development of a drug. How to bring all this together? Connecting the dots – it’s kind of like that.”

In seconds, computer analysis can see hidden patterns, letting a researcher know that a chemical compound has not been patented yet, but is known to interact with a gene responsible for a given cancer. It also can reveal other related genes, unveiling potential pathways along which a new drug might work.

“An individual just cannot do this,” says Chen.

The same big data techniques could have a broad range of applications across many different industries, she notes, such as semiconductor or materials research, consumer product innovation, or genome research.