At present, the BackLinks facility is implemented in the simplest (ie stupidest) possible way - it searches every file in the wiki for links to the target page. This is expensive. In analysing performance, let's say there are N pages in the wiki, of average size M. Let's say that page views happen with frequency v, edits with frequency e, and backlink searches with frequency b. Let k be the average number of links into or out of a page. The current approach imposes a cost on each backlink search of O(N__M), with a fairly large constant (open and read a page; actually, the constant for opening applies to the O(N) part, but never mind.). A couple of alternatives using what we'll call a 'backlink buffer' (B__B): a set attached to each page which lists all the pages with links to that page. There are two strategies for managing and using the B__B: exact and conservative. In the exact strategy, the B__B always refers to those pages which refer to the page; all the referred-to pages point to the target page, and no pages not referred to do so. Using the B__B is then simple - the backlink script simply dumps its contents to HTML, an operation whose time is dominated by the time to read the (small) B__B file, and is thus essentially O(1). Managing the B__B is more complicated; whenever a page is edited, it has to update many B__Bs. Any links added to the page are dealt with easily - the corresponding B__B can be opened, and a link to the edited page added if necessary. Links removed from the page are more difficult to deal with: in order for the edit script to know which B__Bs need to have entires removed from them, it needs to have a record of which B__Bs had references to the edited page before the edit. This could be determined from the old version of the page (which would need to be analysed before the new copy is saved, of course), or from a dedicated outbound link buffer (O__B). Either way, every B__B altered would need to be read, edited and saved. Edits would thus be O(k), with a large constant, corresponding to opening, reading and then writing a file. There is also an O(1) cost associated with reading the old copy of the edited page. In the conservative strategy, the B__B is less accurate: it lists all the pages which may contain references to its page (ie no page not in the B__B may have a link to its page), and it may contain duplicate entries. The B__B is now easier to manage; edited pages simply need to append their names to the B__Bs of all the pages to which they refer, and do not need to remove any entries or enforce any one-entry constraints. This is still O(k), but with a smaller constant (and with a much smaller O(1) term) Using the B__B is, however, more complex; rather than just dumping the whole thing, the backlink script needs to uniquify the B__B and then check every listed page to see if it contains a reference to the target page. This is O(k'), where k' is the average number of entries in a B__B. Once the B__B has been normalised in this fashion, it can also be written back to disk, to reduce the amount of work next time. The exact strategy has better O-numbers, but given the ratio e/b, it might be that the conservative strategy's lower per-edit constant will win out. You could do a lazy demonisation process. Whereby you use the conservative mechanism, but when you resolve the links on access to the backlinks page, then write out the resolved list, normalising it. This means that for O(1) time (writing out the newly generated list) you get most of the benefits of the accurate strategy, and no wasted effort normalising lists that are never viewed. - DS - I _think_ that's what i was intending to do anyway. This has all got a bit fuzzy since i decided to go with the TrikiWiki link files approach. -- TA Another alternative is to augment the naive system with a cache of the last set of results; the only pages which would need to be examined would be pages which had changed since the last backlink search (which can be dated by the cache file). Not every page would necessarily have a cache file. This may be at once the simplest and most effective approach. Some of these approaches may benefit from a daemon process which runs in the background and cleans things up. In the conservative B__B system, the daemon could normalise B__Bs. In the cached system, the daemon could expire old caches. One other vaguely-thought-out approach might be to capture the link topology of the entire wiki in one place (eg a DBM file). The algorithms would then be similar to the other implementations, but they might run faster because only one file is being accessed, and that has an optimised structure. Hopefully, this will be discussed near here: http://c2.com/cgi/wiki?WikiProgramming It is already touched on here: http://c2.com/cgi/wiki?ReverseLinkDisabled TrikiWiki handles backlinks using a conservative record to implement its 'These pages lead here' feature (). Rather than buffer files, it has a directory (called 'links') in which it stores zero-length files whose names encode the links; for every link from A to B, a file called A-B exists. This makes link searches as simple as gloB__Bing the directory, both to construct BackLinks lists and for positive and negative updates on edits. In fact, the BackLinks search is fast enough that TrikiWiki does it at the bottom of every page, which is pretty cool. This is probably the approach i'll take. The only non-trivial modification would be to use a DBM database rather than a directory, but that's probably overkill. N__B The old BackLinksImplementation (which reads all the files every search) is now known as 'thick links'; the new BackLinksImplementation (which reads one file every update) is known as 'thin links'. Not that this matters. _Now also wondering about WikiVsRobots ..._ WilliamRamsden thought we should "avoid using the currently rather stressed backlinks system" and wondered "if 'Category...' is, as I presume, one of the most backlink-invoking aspects of the Wiki, would it be an idea to spread this practice, just to cut down on usage of the backlinks facility?". To which TomAnderson replied: "On the contrary. The backlink facility is a wonderful thing, and we should use it more, it's just that I really, really need to get round to improving the BackLinksImplementation. I'll do this right after I finish moving lab, get some really pressing labwork done and finish applying for PhDs and squash all the irritating little TwicIBugs that are about at present. Probably some time in the new year, then. Also, the way it is implemented at present, adding a category tag has no particular cost to the system, and doing a backlink search has a cost proportional to the total number of pages or the total amount of text on the wiki, so the number of category tags doesn't affect the speed at all. In fact, it does have a slight effect, but that's to make the search *faster* if a link to the target page is present.". Quick note: the thick links system has been modified as part of the work towards thin links. It should now be even slower. However, it should be more correct - the old version would do suspicious things in some cases (like see WikiNames in the middle of WikiRawText blocks and even WikiPictures). CategoryWiki CategoryGeekery