The new seekers

EVER felt you were a victim of information over load? Such is the World Wide Web's popularity and its exponential growth since…

EVER felt you were a victim of information over load? Such is the World Wide Web's popularity and its exponential growth since 1994 that it is rapidly becoming a victim of its own success. With well over 34 million Web pages - and even more hypertext links between them - how do you find the information you need?

The plethora of Internet magazines and printed guides aren't much help either, because they are limited in scope, and are out of date as soon as they hit the shops.

Enter the online indexing services - the Internet's equivalent of the friendly librarian. Basically they use the power of computers to sort out and clean up this information jungle which the very same computers have created. Some 200 of these firms, from Alta Vista to Yahoo, have become essential starting points for finding the right information quickly.

Whatever the service, the basic principle is much the same. You type the word you are looking for in the appropriate slot, select various options and press the "submit" button. After a few seconds a list of "hits" (successful searches) is displayed, often with a score indicating how good the match is.

READ MORE

Wall Street is already predicting that the leading indexing services, with their buzzwords, about "spiders" and "search engines", are going to be the Internet's Next Big Thing. In an increasingly competitive market, the services stress their differences in technical capabilities, and try to differentiate themselves by the search methods they offer.

The classifiers

Firms like Yahoo use the first main approach: they maintain some editorial control over whom they list, classifying Web pages into helpful sub categories and even rating them. As the San Jose Mercury newspaper recently put it, "Yahoo is closest in spirit to the work of Linnaeus, the 18th century botanist whose classification system organised the natural world."

But this in method involves dilemmas similar to those of traditional publishers. Services which provide ratings have the agonising decision about how often to review Web sites, because someone who receives a bad rating is often eager to improve (just as a restaurant might clean up its act after a hostile newspaper review). Another dilemma in the ratings game is whether to place more value on the content of Web pages, or the level of sophistication of their visual design.

The McKinley Group's search engine, Magellan, takes a slightly different tack: it boasts of a "dividing and parsing algorithm" that make abstracts of paragraphs within Web pages.

But the main alternative to the Yahoo type approach is the "we list everybody" one, where quantity is almost everything and the number crunching is almost breathless.

Lycos, for example, says it has over 34 million unique URLs (or Web page addresses), and the much talked about Alta Vista claims it has "nearly 11 billion words found in nearly 22 million Web pages". Alta Vista also boasts a full text index of over 13,000 news groups, "updated in real time". For the technically minded, it speeds up searches by keeping nearly 6 gigabytes of the 33 gigabyte word index in main memory, thanks to 64 bit addressing on Digital's Alpha family of computers.

Webcrawlers

Given that the Web is expanding so rapidly, any index put together manually is bound to be limited and out of date. So the "we list everybody" indexers rely on a crude form of artificial life called Web crawlers. These software robots (or "bots") are really just search programs which are sent off into the Net to carry out an automated census of its contents. They tirelessly scam per along every link, soaking up the info as they go along and sending it back to their home bases.

The crawlers have various degrees of sophistication. Some are instructed to wander down each link as they meet it (and how deeply to follow it), while others are told simply to store it and move on. Since different crawlers have different search algorithms, each type comes back with a different - and still partial - map of the Web world.

Even Web crawlers cannot give an instant snapshot of the Web, because the information about the existence of a new page can take hours or even days to filter back to the home database. An allied problem is that a page might be found in the search engine, but then when the user calls it up later it doesn't exist - or doesn't seem to have their search words in it - because the content has been changed since a crawler's last visit.

Obviously, too, most crawlers will respect any "firewall" (security barrier) they encounter, and won't examine any pages behind it. Some Web pages are on corporate servers that are not publicly accessible and any pages that require additional protocol beyond following a hyperlink (e.g. - that require filling out a form, or registering, or providing a password, etc.) aren't indexed. To complicate matters even further, some documents might be technically on the Web (available from a Web server and retrievable through the right URL), but have no hyperlinks pointing to them from the main body of the Web.

And just like ordinary surfers, Web crawlers have to contend with the day to day headaches of the Internet. Sometimes they detect the existence of a page because they've found a hyperlink to it, but every time they try to retrieve the page to index it, the connection is broken (it has a "time out") due, for example, to gridlocked in that particular stretch of the infobahn.

To compare the various search engines and their databases last week Computimes tried a very general query looking for any Web page with the word "Ireland" in it. Alta Vista came out first with what it claimed were "about 100,000 hits" (see table).

If quantity rather than quality were all that counted, it would be a clear victory for the "bots" over their human counterparts.

But what do you do with all those thousands of hits? There's nothing worse than having to trawl through hundreds or even thousands of results, so the careful use of "constraining words" becomes essential.

How to use them

The first essential step in making a specific search engine give meaningful results is to download its help pages. It's well worth spending a few minutes looking at its examples of different types of search request, and the particular grammar and punctuation they involve.

An important trick is to pick the right keywords (e.g. "ireland" and "irish" can give very different results). Next, learn how to make best use of wild cards" if they are allowed (e.g. typing something on the lines of technology, would find occurrences of both "technology" and technological)

Most search engines also use Boolean operations. This form of logic (invented by UCC's first professor of mathematics, George Boole) uses key words such as AND and NOT for reducing the number of hits to more manageable proportions.

The language can take some getting used to, but it can produce powerful results. For example, giving the instruction to Alta Vista's "advanced query" section to find link:http://www. irish-times.com/ AND NOT url:http://www.irishtimes.com/

will locate any Web pages on its database which have hypertext links to the Irish Times home page - last week it found over 220 (as well as a few Irish Times pages, which somehow slipped through the sieve). So this is a neat way of finding who else is providing routes for users towards your own pages.

A handy tip for journalists is to type in their name and some constraining phrases (such as "Irish Times" and "newspaper"), which results in a listing of most places on the Web where their stories have been quoted and attributed.

Lexicographers, too, have a powerful tool to find out how language is evolving on the Net. An one compiling a cyberdictionary could use a search engine's hit rates to find out the current usage of, say, "email" versus e mail", or count the variations of "CDROM", "CD Rom", CD Rom etc.

With search engines, the possibilities seem almost endless. Without them, the Net would face an almost impossible future.