Desperately seeking specialised searches

There are about 400 million pages on the Internet. Over the next three years that is expected to rise to eight billion. If you think finding what you are looking for is hard today, imagine what it is going to be like by 2002.

Searching for information remains the most common activity on the Internet despite the poor quality of results produced by most search engines. The basic technology of keyword searching - looking for documents in which a particular word appears - is simply not good enough, leaving the major search engines floundering.

There is hope, however. In particular, Inktomi, which provides the search engines for leading sites such as Yahoo, America Online and MSN.com, has just announced a step forward in search techniques called "concept induction". The aim is to give search engines something approximating to a human understanding of documents.

Inktomi has adapted the technique for a new product designed to create an instant directory. If you tell the machine the type of document you think should be included under a Web directory heading, it will go out and find documents on the same subject.

Ideas like concept induction are desperately needed. Not only are search engines bad at finding what you want, they are also slow at doing it. Users want quick results but this becomes more and more difficult as the Web gets bigger.

One response from search engines has been to sacrifice coverage for the sake of speed. By including a smaller and smaller percentage of the Web in search databases, they can continue to produce fast results but may miss useful information.

At Montana State University in the US, Greg Notess monitors the performance of search engines and has found that many are keeping their databases static. With some, such as Lycos, the size of the search database is actually falling.

The crisis facing search engines is compounded by a technical issue. Search engines read and record standard pages on the Web. However, more and more information is delivered by giving access over the Web access to information held in databases. To take one example, commercial filings to the US Securities and Exchange Commission are all available through the Internet, but this information is not presented in a form recognised by search engines and consequently is missed.

Some companies are taking imaginative approaches to handling this astonishing quantity of data. "There is something of a renaissance in search technology because of the commercial opportunities," says Paul Gauthier, chief technology officer at Inktomi, which says it will unveil further new search techniques over coming months.

More specialised search services are also being developed. Northern Light Technologies, for example, targets business users by maintaining one of the most comprehensive search databases and supplementing it with other online business information sources.

The key to producing more intelligent search engines is to be able to recognise what a document is about, rather than simply the words that appear in it. Another key new input will be the increasing use of XML, or extensible mark-up language. This allows any piece of information on the Internet to have a descriptive tag attached to it which says, for example, "this is a document about the price of a Toyota Corolla". Such tags should help search engines distinguish between a document that discusses the economics of the car market and a guide to buying a new car.

Usage patterns could also be of enormous value. Simple popularity rankings are already being built into search engines but there is much more that can be done, such as analysing whether people who searched under the word "car" spent more time at certain pages than people who searched under "automobile".

The next, more ambitious step, is Inktomi's concept induction idea, to create what Gauthier calls a "data-structure to take in the whole Web". He says he wants to break information down into its component parts, and says that concepts can be reduced to a limited number of components. He believes that human knowledge or at least the part of it represented on the Web can be broken down and modelled with just a few thousand basic components.

If the search engine can then identify these components it will be able to have a relatively sophisticated appreciation of the nature of different types of information which is at least reasonably close to the way people categorise and use information.

The final step is to create an interface between the computer and the user that allows the computer to understand what the user is looking for.

The long-term goal is a natural-language interface that allows users to ask a question in plain English and have a computer understand it. Microsoft is working hard in this area, with some success. Attempts to offer these services online such as AskJeeves (www.askjeeves.com) are rudimentary. The alternative is to program the search engine to interrogate the user about exactly what he or she is looking for.

If computers can be made to appreciate the different types of questions a user might be asking about a car, this becomes possible.

These new approaches to searching are still in development but they should start to feed through into improved products in the near future. And it cannot be too soon for users exasperated by endless futile searches.