Back Links Implementation


At present, the Back Links facility is implemented in the simplest (ie stupidest) possible way - it searches every file in the wiki for links to the target page. This is expensive.

In analysing performance, let's say there are N pages in the wiki, of average size M. Let's say that page views happen with frequency v, edits with frequency e, and backlink searches with frequency b. Let k be the average number of links into or out of a page.

The current approach imposes a cost on each backlink search of O(NM), with a fairly large constant (open and read a page; actually, the constant for opening applies to the O(N) part, but never mind.).

A couple of alternatives using what we'll call a 'backlink buffer' (BB): a set attached to each page which lists all the pages with links to that page. There are two strategies for managing and using the BB: exact and conservative.

In the exact strategy, the BB always refers to those pages which refer to the page; all the referred-to pages point to the target page, and no pages not referred to do so. Using the BB is then simple - the backlink script simply dumps its contents to HTML, an operation whose time is dominated by the time to read the (small) BB file, and is thus essentially O(1). Managing the BB is more complicated; whenever a page is edited, it has to update many BBs. Any links added to the page are dealt with easily - the corresponding BB can be opened, and a link to the edited page added if necessary. Links removed from the page are more difficult to deal with: in order for the edit script to know which BBs need to have entires removed from them, it needs to have a record of which BBs had references to the edited page before the edit. This could be determined from the old version of the page (which would need to be analysed before the new copy is saved, of course), or from a dedicated outbound link buffer (OB). Either way, every BB altered would need to be read, edited and saved. Edits would thus be O(k), with a large constant, corresponding to opening, reading and then writing a file. There is also an O(1) cost associated with reading the old copy of the edited page.

In the conservative strategy, the BB is less accurate: it lists all the pages which may contain references to its page (ie no page not in the BB may have a link to its page), and it may contain duplicate entries. The BB is now easier to manage; edited pages simply need to append their names to the BBs of all the pages to which they refer, and do not need to remove any entries or enforce any one-entry constraints. This is still O(k), but with a smaller constant (and with a much smaller O(1) term) Using the BB is, however, more complex; rather than just dumping the whole thing, the backlink script needs to uniquify the BB and then check every listed page to see if it contains a reference to the target page. This is O(k'), where k' is the average number of entries in a BB. Once the BB has been normalised in this fashion, it can also be written back to disk, to reduce the amount of work next time.

The exact strategy has better O-numbers, but given the ratio e/b, it might be that the conservative strategy's lower per-edit constant will win out.

You could do a lazy demonisation process. Whereby you use the conservative mechanism, but when you resolve the links on access to the backlinks page, then write out the resolved list, normalising it. This means that for O(1) time (writing out the newly generated list) you get most of the benefits of the accurate strategy, and no wasted effort normalising lists that are never viewed. - DS

Another alternative is to augment the naive system with a cache of the last set of results; the only pages which would need to be examined would be pages which had changed since the last backlink search (which can be dated by the cache file). Not every page would necessarily have a cache file. This may be at once the simplest and most effective approach.

Some of these approaches may benefit from a daemon process which runs in the background and cleans things up. In the conservative BB system, the daemon could normalise BBs. In the cached system, the daemon could expire old caches.

One other vaguely-thought-out approach might be to capture the link topology of the entire wiki in one place (eg a DBM file). The algorithms would then be similar to the other implementations, but they might run faster because only one file is being accessed, and that has an optimised structure.

Hopefully, this will be discussed near here:

http://c2.com/cgi/wiki?WikiProgramming

It is already touched on here:

http://c2.com/cgi/wiki?ReverseLinkDisabled

Triki Wiki handles backlinks using a conservative record to implement its 'These pages lead here' feature (<http://triki.publication.org.uk/ThesePagesLeadHere>). Rather than buffer files, it has a directory (called 'links') in which it stores zero-length files whose names encode the links; for every link from A to B, a file called A-B exists. This makes link searches as simple as gloBBing the directory, both to construct Back Links lists and for positive and negative updates on edits. In fact, the Back Links search is fast enough that Triki Wiki does it at the bottom of every page, which is pretty cool. This is probably the approach i'll take. The only non-trivial modification would be to use a DBM database rather than a directory, but that's probably overkill.

NB The old Back Links Implementation (which reads all the files every search) is now known as 'thick links'; the new Back Links Implementation (which reads one file every update) is known as 'thin links'. Not that this matters.

Now also wondering about Wiki Vs Robots ...

William Ramsden thought we should "avoid using the currently rather stressed backlinks system" and wondered "if 'Category...' is, as I presume, one of the most backlink-invoking aspects of the Wiki, would it be an idea to spread this practice, just to cut down on usage of the backlinks facility?". To which Tom Anderson replied: "On the contrary. The backlink facility is a wonderful thing, and we should use it more, it's just that I really, really need to get round to improving the Back Links Implementation. I'll do this right after I finish moving lab, get some really pressing labwork done and finish applying for Ph Ds and squash all the irritating little Twic I Bugs that are about at present. Probably some time in the new year, then. Also, the way it is implemented at present, adding a category tag has no particular cost to the system, and doing a backlink search has a cost proportional to the total number of pages or the total amount of text on the wiki, so the number of category tags doesn't affect the speed at all. In fact, it does have a slight effect, but that's to make the search faster if a link to the target page is present.".

Quick note: the thick links system has been modified as part of the work towards thin links. It should now be even slower. However, it should be more correct - the old version would do suspicious things in some cases (like see Wiki Names in the middle of Wiki Raw Text blocks and even Wiki Pictures).

Category Wiki Category Geekery


Sun, 23 Mar 2003 11:00:40 GMT Front Page Recent Changes Message Of The Day