Wiki Cache Problem


The Problem

Web browsers have caches, in which they keep pages after loading them, so that if a page is revisited, it can be loaded from the (local and fast) cache rather than from the (remote and slow) web server. This is a perfect system if pages never change. If pages do change, there is a problem: the page might change after a copy has been cached somewhere, so the user is presented with an old version. Such an old version is called stale, and an up-to-date version is called fresh. The heart of caching is the mechanism for determining if a cached copy is fresh or stale.

The normal browser behaviour is to check that the cached copy is fresh once per session (ie run of the browser). Thus, if pages change seldom, as they do on traditional static websites, this system still works very well - the normal browser behaviour will smoothly pick up occasional changes (eg if you surf the web in the evening, then quit your web browser, then surf again in the morning, any overnight changes will be picked up).

However, it works very poorly for fast-changing content like a wiki, where important changes can and do happen on a timescale of minutes, within a single browser session. If the pages aren't cached at all (as used to be the case with Twic I), then everything works correctly but slowly: the poor server is continually hammered with requests (especially bad in a wiki, where people revisit pages very frequently), leading to a heavy server load (and cross urchin Sys Admins chasing Tom Anderson with LARTs in hand) and sluggish response times for users. On the other hand, if cacheing is used, there is a serious problem with staleness: checking once per session simply isn't often enough for a wiki.

A more subtle aspect of the problem is that the HTML rendering of a page may change even when the Wiki Text behind it hasn't, because of changes in the state of linked pages or in the Twic I engine, so the determination of staleness may actually be incorrect.

Technical determinants of caching

The key to caching is the Last Modified Header. 'Headers' are bits of supplementary data sent by webservers along with the pages and things that they serve, which tell browsers things like what kind of data is being sent, how big it is, what kind of server is serving it, etc. The Last Modified Header simply says when the page was last modified. A browser can easily use this to figure out if a cached copy is fresh, by comparing iys recorded last-modified date to that of the server's copy (either by asking for just the headers of the object, not the actual data, or by sending a special form of request which basically says "send me the data if it has been modified after such-and-such a point in time"). Pages without Last Modified Headers are never cached; those with them and nothing more are always cached.

Say more about extra headers, like expires, cache-control, etc.

Configuring and cajoling your browser

The simplest trick to know is simply to hit 'reload' when you go to a page which you suspect may have changed since you last saw it. This will usually do a freshness check and load a fresh page if the cache is stale. If you don't have a reload button, try poking about in any menus you do have ('view' is a prime suspect) or hitting F5.

To avoid this, it may be possible to configure your browser to do a freshness check every time it visits a page; this will be slower than checking once per session (it still involves sending a message to the browser), but will still have the benefit of caching the actual page content. In Netscape Navigator, this can be done by going to Edit > Preferences > Advanced > Cache and setting 'Page in cache is compared to page on network' to 'every time'.

The more subtle rendering-related problem can probably be overcome by holding down shift while hitting reload, which ought to bypass the freshness check and just load a copy straght from the server.

A more radical solution to the rendering problem is to do a null edit on the page; go to 'edit', then just hit 'submit' - the page will be updated without any actual change being made. However, the modification date will be changed, so a normal reload should pick up the fresh version.

If, like Thea Logie, you find yourself trapped on a locked-down computer where you can't reconfigure the web browser or even hit reload, then you are stuffed. The management suggests you log off and read a book instead. No worries - they fixed that problem. Thank goodness --TL

Solutions

A possible, if loopy, technical solution would be to make the hyperlinks to pages look something like <http://urchin.earth.li/cgi-bin/twic/wiki/view.pl?page=WikiCacheProblem&version=X>, where X corresponds to the current version number (or timestamp or something) of the page. Thus, a link to a page would always be a link to the latest version (or at least, what was the latest version at the time the page was rendered), so browsers wouldn't use a stale cached version. Of course, if the page hadn't changed since it was last visited, then the browser would recognise the URL and correctly use a cached copy. This would involve a bit of code in various places, and would involve examining every linked-to file to determine its version number or timestamp, which probably wouldn't involve too much runtime cost, since the directory entries are being loaded to determine file existence anyway. This solution would be a natural consequence of Wiki History. Twic I could still recognise URLs without version properties; it would simply emit the latest copy of the page, probably along with a Pragma No Cache directive

There may also be much simpler solutions or pseudo-solutions based on more sophisticated use of the HTTP cache control headers, like Expire and stuff. E Tags and the Cache Control Header look particularly promising, although they may also be particularly frustrating. Also, it's not clear how widely supported they are.

A technical solution to the subtle rendering-related problem would be to emit a Last Modified Header for the rendering, based on the modification times of the page text, the states of the linked pages, and the wiki engine. Yuck.

The best solution would be to mark pages with a Last Modified Header, so that they can be cached, but force the client to revalidate them every time they are visited. It looks like it might be possible to do this by mark pages with an expiry date that is in the past (using an Expires or Cache-control: max-age header); this would force a cache-conscious reload. Alternatively, the Cache-control: must-revalidate header may do this. The spec says "An expiration time cannot be used to force a user agent to refresh its display or reload a resource; its semantics apply only to caching mechanisms, and such mechanisms need only check a resource's expiration status when a new request for that resource is initiated."; in our situation, this is a good thing.

We might use E Tags here; these are to some extent orthogonal to the general problem of staleness, but might be a useful adjunct. We could either generate them ourselves (eg from a timestamp) or use Cgi Buffer. An advantage of E Tags is that, according to the spec, they can be used "to avoid certain paradoxes that might arise from the use of modification dates.", which can only be a good thing: it would be a pain if the wiki were to vanish into a temporal black hole.

The philosophical framework for cache control in HTTP puts everything in terms of 'validators'; a validator is some piece of information specific to a copy of a page which can be used to determine if two copies are the same, or rather if one is fresher than the other. The classic validator is the Last Modified Header, but E Tags are also validators. The spec distinguishes between two classes of validators: strong, which can be used to detect any change whatsoever, and weak, which can be used to determine semantically significant changes. This distinction actually goes to the heart of the subtle secondary Wiki Cache Problem, about freshness of renderings: when the rendering of a page depends on the state of othr pages, the last-modified date of the Wiki Text is only a (very) weak validator. E Tags are usually supposed to be strong, but weak E Tags can be made and used (not sure how).

A problem here is that some of this is HTTP 1.1 specific, so it might not work on all browsers.

Also, Twic I should generate a Date header on all pages (To Do).

Further reading

For information and tests about web caching, see:

Category Wiki Category Geekery


Thu, 18 Dec 2003 11:34:31 GMT Front Page Recent Changes Message Of The Day