‘Secret sauce’ for web image searches

Anyone who has ever gone looking for an image on a search engine such as Bing or Google knows how hit-and-miss such an endeavour can be.

Type in a search term like "Ferrari Formula One" and up will come lots of images of the cars, but even images that come back among the top returns can vary greatly in quality. And along with the car images will be plenty of pictures that seemingly haven't anything to do with cars at all.

That’s because searching for images or video on the internet remains a difficult challenge, reliant on people putting tags on images to identify the content, or textual elements of the page that might indicate the content of the image. So search engines don’t do much with images at all.

"One reason why the web is so compelling is that we get the gist of a web page mostly from the images," says Lorenzo Torresani, assistant professor of computer science and head of the Visual Learning Group at Dartmouth College in the United States. "We look at the title, but if we have a few pictures at the top, that's what we look at.

“Yet it’s remarkable that search engines do the opposite. They actually ignore, they strip away, that kind of information.”

Torresani thinks he has a better solution, which addresses one of the key problems preventing more productive searches: the fact that popular search engines base results on an analysis of the text on a page alone. Information about digital images is encoded in a different language of visual descriptors that mean little to a text-based search engine.

Thanks to huge advances in image recognition in the past few years, Torresani says that high level information about an image can now be automatically deduced by computer programs, giving strong clues about the actual contents of an image, such as whether a given pixel is likely to represent a steel, glass or grass surface. But that information isn’t read by search engines.

People working on text-based information retrieval tend to think the problem is simply too difficult to resolve as searches would slow significantly if, say, Google were to try and evaluate billions of individual images along with text-based documents.

Cracking the problem
In Dublin this week to present his group's work at the 36th annual SIGIR (Special Interest Group on Information Retrieval) conference, Torresani set out to see whether there might be a way that the contents of an image could also be understood by search engines, improving search results for both documents and images.

In a collaboration between Microsoft’s UK research lab in Cambridge, where he worked for two years, and Dartmouth, where he has been based for half a decade, Torresani and his team think they have cracked the problem.

They have developed an approach that effectively creates a translator between the “languages” of text and image, bringing together the worlds of information retrieval and image recognition.

The first element – his “secret sauce”, he says – is that they have devised a way of swiftly analysing a picture at pixel level to obtain very detailed visual descriptors, and then have that data packaged up in very compact form so that it can be quickly understood by a computer.

They then use “a simple trick but it works amazingly well”. They use their visual analyser to look at the images that come back with the page returns of a normal image search on a text-based query to Google or Bing. It learns from the pictures associated with the top site returns, which images are most likely to be relevant.

Now, it can recognise with a high degree of accuracy, other relevant images. “Our system does this in the background so the user doesn’t even notice it – it takes say the top 100 images and it uses them as training examples to learn on the fly a visual model of how Ferrari Formula One cars look, and it does it super efficiently, in a 10th of a second.”

Then, their search system reranks the pages returned in a regular search on Google or Bing. “Rather than searching for all the text, all the images on the web for every single query, which would be prohibitive, we say, look, traditional web search engines are already quite accurate so let’s leverage on that.”

The final results are a blend of the accuracy of a Google or Bing text query, and his team’s system’s ability to pinpoint sites with the very best images.

"I am only looking at documents that are already ranked high by good, accurate document search engines, so I already have removed all the [potential] silly mistakes and now am also working with this image recognition system to boost accuracy."

Business hopes
Torresani hopes the technique might become the basis for a company. He already has a track record as an entrepreneur – as a graduate student at Stanford University he and several other graduates set up Like.com, a website that also coupled image recognition with search to enable people to find high-end products such as jewellery, shoes and handbags online. The company was sold to Google in 2010.

He would like to stay involved for a couple of years but says he prefers having one foot in industry, and one in research and academics.

“I have a love and hate relationship with industry. I like the fact that gives you very nice concrete problems to work on, problems particularly where you can make a big impact. But at same time I have also felt a bit constrained by the timelines and the priorities a company has. So I have found the best for me is to work in research, but to spend time in industry.”