Fossil with Commonmark: Defense Against Spiders

The website presented by a Fossil server has many hyperlinks. Even a modest project can have millions of pages in its tree, and many of those pages (for example diffs and annotations and ZIP archive of older check-ins) can be expensive to compute. If a spider or bot tries to walk a website implemented by Fossil, it can present a crippling bandwidth and CPU load.

The website presented by a Fossil server is intended to be used interactively by humans, not walked by spiders. This article describes the techniques used by Fossil to try to welcome human users while keeping out spiders.

The "hyperlink" user capability

Every Fossil web session has a "user". For random passers-by on the internet (and for spiders) that user is "nobody". The "anonymous" user is also available for humans who do not wish to identify themselves. The difference is that "anonymous" requires a login (using a password supplied via a CAPTCHA) whereas "nobody" does not require a login. The site administrator can also create logins with passwords for specific individuals.

The "h" or "hyperlink" capability is a permission that can be granted to users that enables the display of hyperlinks. Most of the hyperlinks generated by Fossil are suppressed if this capability is missing. So one simple defense against spiders is to disable the "h" permission for the "nobody" user. This means that users must log in (perhaps as "anonymous") before they can see any of the hyperlinks. Spiders do not normally attempt to log into websites and will therefore not see most of the hyperlinks and will not try to walk the millions of historical check-ins and diffs available on a Fossil-generated website.

If the "h" capability is missing from user "nobody" but is present for user "anonymous", then a message automatically appears at the top of each page inviting the user to log in as anonymous in order to activate hyperlinks.

Removing the "h" capability from user "nobody" is an effective means of preventing spiders from walking a Fossil-generated website. But it can also be annoying to humans, since it requires them to log in. Hence, Fossil provides other techniques for blocking spiders which are less cumbersome to humans.

Automatic hyperlinks based on UserAgent

Fossil has the ability to selectively enable hyperlinks for users that lack the "h" capability based on their UserAgent string in the HTTP request header and on the browsers ability to run Javascript.

The UserAgent string is a text identifier that is included in the header of most HTTP requests that identifies the specific maker and version of the browser (or spider) that generated the request. Typical UserAgent strings look like this:

Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0
Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0)
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Wget/1.12 (openbsd4.9)

The first two UserAgent strings above identify Firefox 19 and Internet Explorer 8.0, both running on Windows NT. The third example is the spider used by Google to index the internet. The fourth example is the "wget" utility running on OpenBSD. Thus the first two UserAgent strings above identify the requestor as human whereas the second two identify the requestor as a spider. Note that the UserAgent string is completely under the control of the requestor and so a malicious spider can forge a UserAgent string that makes it look like a human. But most spiders truly seem to desire to "play nicely" on the internet and are quite open about the fact that they are a spider. And so the UserAgent string provides a good first-guess about whether or not a request originates from a human or a spider.

In Fossil, under the Admin/Access menu, there is a setting entitled "Enable hyperlinks for "nobody" based on User-Agent and Javascript". If this setting is enabled, and if the UserAgent string looks like a human and not a spider, then Fossil will enable hyperlinks even if the "h" capability is omitted from the user permissions. This setting gives humans easy access to the hyperlinks while preventing spiders from walking the millions of pages on a typical Fossil site.

But the hyperlinks are not enabled directly with the setting above. Instead, the HTML code that is generated contains anchor tags ("<a>") without "href=" attributes. Then, javascript code is added to the end of the page that goes back and fills in the "href=" attributes of the anchor tags with the hyperlink targets, thus enabling the hyperlinks. This extra step of using javascript to enable the hyperlink targets is a security measure against spiders that forge a human-looking UserAgent string. Most spiders do not bother to run javascript and so to the spider the empty anchor tag will be useless. But all modern web browsers implement javascript, so hyperlinks will show up normally for human users.

Further defenses

Recently (as of this writing, in the spring of 2013) the Fossil server on the SQLite website (http://www.sqlite.org/src/) has been hit repeatedly by Chinese spiders that use forged UserAgent strings to make them look like normal web browsers and which interpret javascript. We do not believe these attacks to be nefarious since SQLite is public domain and the attackers could obtain all information they ever wanted to know about SQLite simply by cloning the repository. Instead, we believe these "attacks" are coming from "script kiddies". But regardless of whether or not malice is involved, these attacks do present an unnecessary load on the server which reduces the responsiveness of the SQLite website for well-behaved and socially responsible users. For this reason, additional defenses against spiders have been put in place.

On the Admin/Access page of Fossil, just below the "Enable hyperlinks for "nobody" based on User-Agent and Javascript" setting, there are now two additional subsettings that can be optionally enabled to control hyperlinks.

The first subsetting waits to run the javascript that sets the "href=" attributes on anchor tags until after at least one "mouseover" event has been detected on the <body> element of the page. The thinking here is that spiders will not be simulating mouse motion and so no mouseover events will ever occur and hence the hyperlinks will never become enabled for spiders.

The second new subsetting is a delay (in milliseconds) before setting the "href=" attributes on anchor tags. The default value for this delay is 10 milliseconds. The idea here is that a spider will try to render the page immediately, and will not wait for delayed scripts to be run, thus will never enable the hyperlinks.

These two subsettings can be used separately or together. If used together, then the delay timer does not start until after the first mouse movement is detected.

See also Managing Server Load for a description of how expensive pages can be disabled when the server is under heavy load.

The ongoing struggle

Fossil currently does a very good job of providing easy access to humans while keeping out troublesome robots and spiders. However, spiders and bots continue to grow more sophisticated, requiring ever more advanced defenses. This "arms race" is unlikely to ever end. The developers of Fossil will continue to try improve the spider defenses of Fossil so check back from time to time for the latest releases and updates.

Readers of this page who have suggestions on how to improve the spider defenses in Fossil are invited to submit your ideas to the Fossil Users mailing list: fossil-users@lists.fossil-scm.org.