= The Problem Web browsers have caches, in which they keep pages after loading them, so that if a page is revisited, it can be loaded from the (local and fast) cache rather than from the (remote and slow) web server. This is a perfect system if pages never change. If pages do change, there is a problem: the page might change after a copy has been cached somewhere, so the user is presented with an old version. Such an old version is called _stale_, and an up-to-date version is called _fresh_. The heart of caching is the mechanism for determining if a cached copy is fresh or stale. The normal browser behaviour is to check that the cached copy is fresh once per session (ie run of the browser). Thus, if pages change seldom, as they do on traditional static websites, this system still works very well - the normal browser behaviour will smoothly pick up occasional changes (eg if you surf the web in the evening, then quit your web browser, then surf again in the morning, any overnight changes will be picked up). However, it works very poorly for fast-changing content like a wiki, where important changes can and do happen on a timescale of minutes, within a single browser session. If the pages aren't cached at all (as used to be the case with TwicI), then everything works _correctly_ but _slowly_: the poor server is continually hammered with requests (especially bad in a wiki, where people revisit pages very frequently), leading to a heavy server load (and cross urchin SysAdmin__s chasing TomAnderson with LART__s in hand) and sluggish response times for users. On the other hand, if cacheing is used, there is a serious problem with staleness: checking once per session simply isn't often enough for a wiki. A more subtle aspect of the problem is that the HTML rendering of a page may change even when the WikiText behind it hasn't, because of changes in the state of linked pages or in the TwicI engine, so the determination of staleness may actually be incorrect. = Technical determinants of caching The key to caching is the LastModifiedHeader. 'Headers' are bits of supplementary data sent by webservers along with the pages and things that they serve, which tell browsers things like what kind of data is being sent, how big it is, what kind of server is serving it, etc. The LastModifiedHeader simply says when the page was last modified. A browser can easily use this to figure out if a cached copy is fresh, by comparing iys recorded last-modified date to that of the server's copy (either by asking for just the headers of the object, not the actual data, or by sending a special form of request which basically says "send me the data if it has been modified after such-and-such a point in time"). Pages without LastModifiedHeader__s are never cached; those with them and nothing more are always cached. _Say more about extra headers, like expires, cache-control, etc._ = Configuring and cajoling your browser The simplest trick to know is simply to hit 'reload' when you go to a page which you suspect may have changed since you last saw it. This will usually do a freshness check and load a fresh page if the cache is stale. If you don't have a reload button, try poking about in any menus you do have ('view' is a prime suspect) or hitting F5. To avoid this, it may be possible to configure your browser to do a freshness check every time it visits a page; this will be slower than checking once per session (it still involves sending a message to the browser), but will still have the benefit of caching the actual page content. In NetscapeNavigator, this can be done by going to Edit > Preferences > Advanced > Cache and setting 'Page in cache is compared to page on network' to 'every time'. The more subtle rendering-related problem can probably be overcome by holding down shift while hitting reload, which ought to bypass the freshness check and just load a copy straght from the server. A more radical solution to the rendering problem is to do a null edit on the page; go to 'edit', then just hit 'submit' - the page will be updated without any actual change being made. However, the modification date will be changed, so a normal reload should pick up the fresh version. If, like TheaLogie, you find yourself trapped on a locked-down computer where you can't reconfigure the web browser or even hit reload, then you are stuffed. The management suggests you log off and read a book instead. _No worries - they fixed that problem. Thank goodness --TL_ = Solutions A possible, if loopy, technical solution would be to make the hyperlinks to pages look something like , where X corresponds to the current version number (or timestamp or something) of the page. Thus, a link to a page would always be a link to the latest version (or at least, what was the latest version at the time the page was rendered), so browsers wouldn't use a stale cached version. Of course, if the page hadn't changed since it was last visited, then the browser would recognise the URL and correctly use a cached copy. This would involve a bit of code in various places, and would involve examining every linked-to file to determine its version number or timestamp, which _probably_ wouldn't involve too much runtime cost, since the directory entries are being loaded to determine file existence anyway. This solution would be a natural consequence of WikiHistory. TwicI could still recognise URLs without version properties; it would simply emit the latest copy of the page, probably along with a PragmaNoCache directive There may also be much simpler solutions or pseudo-solutions based on more sophisticated use of the HTTP cache control headers, like Expire and stuff. ETags and the CacheControlHeader look particularly promising, although they may also be particularly frustrating. Also, it's not clear how widely supported they are. A technical solution to the subtle rendering-related problem would be to emit a LastModifiedHeader for the _rendering_, based on the modification times of the page text, the states of the linked pages, and the wiki engine. Yuck. The best solution would be to mark pages with a LastModifiedHeader, so that they can be cached, but force the client to revalidate them every time they are visited. It looks like it might be possible to do this by mark pages with an expiry date that is in the past (using an Expires or Cache-control: max-age header); this would force a cache-conscious reload. Alternatively, the Cache-control: must-revalidate header may do this. The spec says "An expiration time cannot be used to force a user agent to refresh its display or reload a resource; its semantics apply only to caching mechanisms, and such mechanisms need only check a resource's expiration status when a new request for that resource is initiated."; in our situation, this is a _good_ thing. We might use ETags here; these are to some extent orthogonal to the general problem of staleness, but might be a useful adjunct. We could either generate them ourselves (eg from a timestamp) or use CgiBuffer. An advantage of ETags is that, according to the spec, they can be used "to avoid certain paradoxes that might arise from the use of modification dates.", which can only be a good thing: it would be a pain if the wiki were to vanish into a temporal black hole. The philosophical framework for cache control in HTTP puts everything in terms of 'validators'; a validator is some piece of information specific to a copy of a page which can be used to determine if two copies are the same, or rather if one is fresher than the other. The classic validator is the LastModifiedHeader, but ETags are also validators. The spec distinguishes between two classes of validators: strong, which can be used to detect any change whatsoever, and weak, which can be used to determine semantically significant changes. This distinction actually goes to the heart of the subtle secondary WikiCacheProblem, about freshness of renderings: when the rendering of a page depends on the state of othr pages, the last-modified date of the WikiText is only a (very) weak validator. ETags are usually supposed to be strong, but weak ETags can be made and used (not sure how). A problem here is that some of this is HTTP 1.1 specific, so it might not work on all browsers. Also, TwicI should generate a Date header on all pages (ToDo). = Further reading For information and tests about web caching, see: - http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html - http://www.mnot.net/cache_docs/ -- http://www.mnot.net/cgi_buffer/ the CgiBuffer library - looks useful - http://www.web-caching.com/ - http://www.procata.com/cachetest/ CategoryWiki CategoryGeekery