Peter Norvig has been attempting to explain the unreasonable effectiveness of data, writes KARLIN LILLINGTON
IT IS strangely satisfying to find that senior Google people generally excel in the area of odd adjunct career feats.
The internet giant’s director of research, Peter Norvig, is no exception and can cite two.
First, there’s his Gettysburg Address PowerPoint presentation, a hilarious, hugely popular rendering of Abraham Lincoln’s eloquent speech into six PowerPoint slides. As a commentary on the deflating effect slides and bulletpoints can have on the art of modern rhetoric, nothing surpasses it.
Second, he lays claim to having produced the world’s longest palindrome sentence, a sentence that reads the same forward and backwards. His record is a mind-boggling 17,826 words, produced by a computer program he likes to keep tweaking towards further palindromic perfection.
Curious pastimes aside, Norvig comes with serious research credentials, of course. Prior to his current role, Norvig was director of search quality at Google – a pivotal role in a company that triumphed a decade ago by becoming the definition of a quality search. He has also worked at Nasa as head scientist and done some academic time too (no doubt where the familiarity with Powerpoint comes from).
Norvig was in Ireland this week to give the annual Boyle Lecture at University College Cork. Entitled The Unreasonable Effectiveness of Data – How Billions of Trivial Data Points can Lead to Understanding, it was a look at how and why some computing challenges can be solved by aggregating and analysing patterns in huge amounts of data.
His title is a play on a famous 1960 mathematics paper by Eugene Wigner, “The Unreasonable Effectiveness of Mathematics in the Natural Sciences”, which observed how basic mathematical formulas rather inexplicably explain everything from the way a snail’s shell spirals to the movement of ripples on a pond.
“In the physical world, mathematical formulas work well to explain things, but in other areas, not so well,” says Norvig. “So I’m asking, if you observe millions and millions of data points, does that help?”
And his answer – as might be expected from someone from a company synonymous with massive data crunching – is “yes”. As an example, he points to machine translation – getting computers to translate text from one language to another.
“There was this idea in the 50s that translating language was just like decoding codes,” he says. Well known linguistic scholar and philosopher Noam Chomsky countered that belief by suggesting language was too complex for machines to approach translation in such a basic way.
And true enough, machine translation had generally frustrated computer scientists for decades. But, cue today’s cheap availability of both computing power and storage and Google’s expertise at working with bottomless amounts of data.
“We said, instead, why not get samples of phrases in different languages and match them up.” A few examples – even a few thousand are not enough to enable a computer to translate a document, Norvig says. “But get millions and millions, and it works fairly well.”
And that is how Google Translate works – by breaking a document into phrases, searching a vast database of millions of similar or identical phrases to find what one generally means, then going to the next phrase.
Image processing is another area that can use the same approach, says Norvig. Rather than teaching a computer what a single table looks like from every possible angle to understand the concept of a table, you offer a computer millions of images of tables, says Norvig.
In short, he says, “it certainly seems like there’s lots of tasks for which this approach will work. You can approximate human performance if you can get enough examples.”
It’s a curious angle on artificial intelligence, where intelligence goes out the door and is replaced by the ability to shuffle through a universe of data very very fast.
In some other ways, too, Google is shaping its search offerings by using its honed ability to draw conclusions from lots of data. Google recently announced it was changing its search algorithms to provide “personalised” search.
By basing results for any new search on all of a person’s previous searches on Google, the company hopes to give every user the most relevant set of search results possible.
Norvig notes that one person searching for the word “jaguar” my be interested in big cats, while another may be a car fanatic. Finding ways to guess the context of what its search engine users want is a major challenge and Google thinks personalised search will help.
“You will get things that you want, more of the time,” says Norvig.
Not everyone agrees – some critics feel such an approach limits results and can give a too-personal view of where sites stand in Google rankings. Using someone else’s computer might also skew returns. But overall Google feel this tweak will help individual searchers get a better search experience.
Google has to continually re-evaluate and rethink what search means and take account of new technology developments, he says. Among the latest is the possibility of incorporating increasingly detailed location-based searches, especially on web browsers on mobile handsets that can pinpoint a user’s exact locale. Location-based search, however, introduces more of the privacy questions that dog a company with a business model based around using and analysing user data.
But Norvig feels users will find some trade-off of information for useful services will be acceptable, and that users themselves will decide what they feel comfortable with.
Another challenge for the company is incorporating the increasing flow of information coming from the various elements of the “social web” – weblogs, Twitter, videos, profile sites and sources such as “content farms” like Ask.com that aggregate reams of content, often of mediocre quality.
Some Google users have advocated separating out blogs and other social web results from traditional webpages. But Norvig thinks Google has the best chance of providing a good array of responses to a search by offering slices of social networks as well. “Our feeling is we’d rather have you go to one spot,” he says. “It’s too hard to know what you want.”
Many also find they didn’t know they would want, say, video results until they were presented in the search response, only to find a video provides the information they want, he says.
And he’s reassuring on one point. Even if the company is trying to give the most relevant results possible, the sheer breadth of what’s out there on the web means serendipity will still be a key element of search.
“All the curious stuff will still be there,” he says.
Gettysburg presentation: norvig.com/Gettysburg/
Palindromes: norvig.com/palindrome.html