Wiki Vs Robots

Doing some reading about implementation details of Wards Wiki ...

I wonder if i should protect the Back Links facility somehow, to prevent spiders, webcrawlers and other robots from jamming the server. This would be fairly simple - make the script work via POST rather than GET, and turn the link into a form, with a link as the submit action (although i don't actually know how to do that; i'm not even certain it's possible). The page title could go in a hidden input or could still be passed as a query-string parameter (i prefer the latter).

There does appear to be quite a bit of google cache of the whole thing, including the edit pages. Maybe there should be some sort of robots.txt or other anti-google mechanism on all of the non-page pages. I think you can have html-meta information in the headers of a page to say not to cache it, however that would still lead to it being generated.

robots.txt isn't useful here because it has to be located at the site root, which would involve pestering the urchin admins even more (you really don't want to know how much trouble they get from Tom Anderson as it is ...). Meta tags are a great idea, though:

<head>
<!-- ... -->
<meta name="ROBOTS" content="NOINDEX">

The only problem is that they're merely advisory, meaning that a simple or impolite robot could ignore them. The POST thing seems to be stronger; Tim uses it on the OUSFG website to protect a stash of email addresses.

Category Geekery

Fri, 13 Dec 2002 20:39:39 GMT

Front Page

Recent Changes

Message Of The Day