Cache me if you can

Speed, or the lack of it, consistently ranks near the top when people are asked why they don't use the Internet more. This is no surprise when many users pay by time for phone links to the Net and it feels as though they spend longer downloading pages than reading them.

"Caching" is the main tool Web browsers (such as Netscape Communicator or Microsoft Internet Explorer) use to speed up surfing. In real life, a cache is a hiding place where squirrels store their nuts, or paramilitary groups their weapons. A browser's cache is similar. After downloading a Web page, the browser stores the page in a special location on the hard disk. This cache makes surfing faster, because returning to a cached page involves just loading a file from the hard disk.

Browser caches are only the beginning of the story. Many users connect to the Internet through "proxy servers", specialised computers that manage Internet access for large numbers of users. For example, an organisation concerned about security might set up a firewall - an electronic barrier surrounding its computers - to block unauthorised network traffic. To allow Web access, the organisation uses a proxy server, a particular computer which is allowed to cross the firewall.

Browsers within the organisation are set up to connect to the Internet via this proxy server.

Just as a browser cache speeds surfing for a single user, a proxy server cache helps the entire organisation. If Jack downloads www.hill.com today and Jill fetches it tomorrow, Jill gets it faster, because the proxy cached the page after retrieving it for Jack - instead of having to retrieve it from the Internet, it just reads from disk. But Jack and Jill's organisation is only one of many served by their Internet Service Provider (ISP) - further out into the network, the ISP might set up its own caching proxy server. Extending this idea, some Internet researchers have proposed "co-operative caches" - pools of caches that periodically swap pages, or route requests to one another. The idea is to distribute the work across several proxy servers.

We have seen several possible caching points: an individual user's browser, the user's organisation, the organisation's ISP, or several ISPs sharing a pool of caches. At each point, more users are serviced by each cache. The more users, the greater the chance that two will request the same page, so the more beneficial the cache becomes.

That's the theory anyway. And in practice? Many organisations with proxy servers find that nearly half the pages requested are cache "hits" - pages that do not require downloading because they are already stored in the cache.

Mathematically-minded Internet researchers have estimated that the number of cache hits grows logarithmically with the number of users. This means that each additional user causes more hits, but the increase drops as the user group grows. If the number of users increases from 50 to 100, the number of hits will increase by 20 per cent, but with a further 50 users, the number of hits will only increase by 10 per cent.

Unfortunately, more sophisticated caches incur greater overhead, eating into the total benefit. Even though up to half the requests are cache hits, the saving in download time is only about one quarter - noticeable, but hardly earth-shaking. Cache overhead consumes some of the difference. But it turns out that most hits are for small pages - the very pages that benefit least from caching, since downloading them is fast without caching.

The second technique used to improve browsing is "replication". As a simple example some busy sites list alternative "mirror" sites around the globe. These mirrors are copies of the original; by choosing a nearby mirror, the user gets pages to download much faster.

Replication promises greater performance improvements than caching, but several problems must be overcome. How many mirrors are sufficient, and where should they be located? When the user wants to fetch a page, how should the request be routed to the nearest mirror? Unfortunately, the easiest way is to have the user manually select a mirror.

Ideally, sites could select the appropriate mirror automatically and various schemes have been developed to this end. For example, when you request a page from www.ibm.com, your browser first asks network name-servers to translate this name into the numeric address needed to process the request. The process is similar to looking up someone's phone number. If a user in Asia asks for a translation of www.ibm.com, IBM could arrange for the translation to be a server in Japan, while requests from Peru could be directed to IBM's Brazilian server.

However, both replication and caching suffer from a serious problem. Suppose the original page changes - how does the cache or mirror realise that the copy is out of date? An online newspaper, for example, might adjust its headlines every few hours but if old copies are not discarded by caches and mirrors, the user will get yesterday's headlines.

The Internet infrastructure provides a simple mechanism to solve this problem. When a page is downloaded, the server can include an expiration date to help caches decide when to discard the page. Unfortunately, few servers bother to include this information, so caches just have to guess.

There is no easy answer. Caches and mirrors already improve browsing speeds substantially, but researchers continue to seek ways to combine these and other techniques to further improve Internet responsiveness.

-info: www.cs.ucd.ie/staff/nick/itr

Nicholas Kushmerick: nick@ucd.ie