What is most needed right now is a taxonomy of link problems and
problem sources. It is difficult for people to understand the scope
of these issues without laying out a detailed taxonomy. Although I
have one inside my head, I just haven't found the time to write it out
and fill in the holes.
Martijn writes:
> I have just spent my weekly half hour looking at my server's error
> logs, trying to find failed retrievals that are caused by broken
> links. Thanks to the CERN HTTPD's handy logging of Referer, and an
> error-log summarising Perl script I ended up writing, this is
> reasoneably effective. I fixed one local broken link, and mailed about
> 5 remote sites. Especially the latter is tedious; you have to make
> time to write a message (even if that is also facilitated by a Perl
> script :-), and you're fixing other people's mistakes (some of the
> broken URLs never even existed!)
>
> So I'm wondering if we're ever going to do something about this. It
> is obvious that many people (including myself) don't (want to) run
> local link-cheking robots regularly enough. And the "offending"
> servers don't know they're serving broken links unless they're local.
> But that ought to change, IMHO.
Maybe. I spent a great deal of time thinking about this problem, and
the fact is that a broken link is not always a "real" problem. Furthermore,
an protocol-level, automated notification system is almost never desirable --
there is simply no way to be accurate-enough in targeting the notification
to prevent the notification system itself from being worse than the broken
link.
There is also a significant difference between proactive maintenance
of links (via a tool like MOMspider) and reactive maintenance of links.
Proactive maintenance requires some form of traversal tool. Reactive
maintenance can be performed via accurate error-log reporting and analysis.
However, in both cases, some human inspection of the problem is required
before a (possibly computer-assisted) notification can be sent.
Even after the problem has been identified, it is difficult to ascertain
who (or what) should be notified. Quite often the mailing address of
whom to notify is not on the Referer: page itself, but on some page above it.
Sometimes the notification address is also in error (or obsolete).
Most importantly, however, there are issues of scale -- if notification
is sent on every bad request, a popular link may result in several thousand
notifications in just one day. If the "problem" is just a temporary NFS
error local to the server's site, the maintainer is faced with several
thousand false alarms.
For example, the following problem occurred on my server this month.
All ICS undergrads in our department have a user account on our network
of Sun servers/workstations. These accounts (if the proper permissions
are applied and a special subdirectory is created) are accessible via the
ICS webserver. Thus, each undergrad can build their own webspace, without
any interference from me (the ICS webmaster).
Unfortunately, these are not permanent accounts -- they expire six months
after an undergrad leaves the University (preferably by graduating) -- and
the beginning of this month just happens to be six months after the first
web-conscious graduating class up-and-left. Now, for most individuals,
this was no big deal, since their personal webspaces were taken down long
ago and they were not very interesting anyway. However, one individual
not only had stuff worth linking to, but he had the unfortunate habit
of advertizing his stuff to every meta-index web in the known universe.
Naturally, he didn't tell me he graduated, and thus I had no opportunity
to move his worthwhile stuff to some better location and redirect the
requests before his account got zeroed-out. One day it was a perfectly
good link, the next it resulted in several thousand error_log messages.
Similarly, it would not have done any good for the owner's email address
to be advertized with the link -- his e-mail account was equally dead.
So, what we have here is a broken link *because* the resource is permanently
no longer available (as opposed to never was available, available at some
other location, temporarily unavailable, temporarily moved, permanently
moved, etc.). Any notification that I send out should include the reason --
after all, I am asking the owner of the link to do us a favor by permanently
removing the link.
Before sending notifications, however, I needed to find the Referers.
Now, if I had more free time, I'd add the logic to my NCSA httpd server
to add more info to the error_log file. Instead, I just added a redirect
Redirect /~lsanchez http://www.ics.uci.edu/cgi-bin/gone
and wrote a quick perl cgi hack that sent a nice message to the user
that the resource was no longer available, while at the same time recording
the REFERER, REMOTE_HOST, REMOTE_ADDR, REMOTE_USER, PATH_INFO, PATH_TRANSLATED,
and HTTP_USER_AGENT (I wasn't sure exactly what I would need).
The next day, I sorted through the output and found 6 unique Referers at
four different sites (some meta-indexes had multiple links). Note that
these six references are several orders of magnitude less than the number
of broken link traversals that occurred (many of which had invalid Referers).
I then turned off the logging portion of the script -- users still get a
polite message, but I'm a little low on disk space at the moment.
I used XMosaic to visit those sites, wandered around until I found the
correct e-mail address for fixes, and then sent off a short message to
that address (I could have automated this further, but it was easier to
just cut-n-paste). Over the course of the following week, I received
responses from the maintainers of those sites indicating that they had
removed the links, along with thanks for reporting it to them. I don't
think they would have been quite so thankful if I caused several thousand
automatic notifications to be sent their way.
So, that's my most-recent experience. Now, how does it stack up to Martijn's
list of random thoughts...
> Some random thoughts about this:
>
> Idea 1: This could be changed by having clients that find a broken
> URL send the offending server an HTTP/1.1 method BROKEN, with two
> fields: URL (the broken URL) and Referer (the URL of the page with the
> broken URL). A server can then log this, for later analysis by
> humans/Perl scripts/whatever. Obviously a client doesn't do this if
> the user cancelled or the connection timed out, or if there is no
> Referer.
Well, assuming that the link was valid up-to-the-path (i.e. the scheme
and server name were correct in the otherwise bad link), this information
could have easily been recorded in the server's error log upon the first
request (and thus there is no need for a second one). Worse, if the scheme
or server name were incorrect, you've just doubled the number of invalid
requests.
>...
> Idea 2: rather than the client implementing this, the server can do so
> instead; when finding a failed URL it can initiated the BROKEN method
> to the server found in the Referer (pity so many Referers lie). This
> also reduces the repeats if a server remembers it has flagged a
> particular error situation.
And if the error is only a temporary problem? Sections of my server go
down several times a month (sometimes on purpose ;-) -- I don't want all
references to my site being removed because of a 1% downtime problem.
Even if this idea is desirable, it would make more sense for a separate
program to be sending the BROKEN method based upon the contents of a logfile.
In fact, it would be trivial to write such a program using libwww-perl.
In this way, a live human could be queried before the message is sent,
and the user could see whether or not the message was accepted. In general,
I think its a bad idea to make origin servers also perform client actions.
Hmmmmmmmm....I bet you could write such a script faster than I wrote
this message.
>...
> Idea 3: send the repsonsible person's mail address in the http
> request. The client can then mail automatically when it finds a broken
> link, and the user can mail manually with other comments. I don't
> like that at all, as it relies on email, fills up your mailbox, and can
> easily be abused.
Yep, a total disaster due to the problems of scale, privacy, excess junk
in every request, and the minor problem of knowing who is the responsible
person in the first place.
> Idea 4: Instead of sending an email address, send a URL that is to be
> retrieved should the link be found broken. This can be a script that
> either logs the event, or is clever enough to take another action.
> This is a variant of Idea 1, but smells rather of a hack.
Again, if you got to the right server in the first place (and they care
enough to want this information), the error log already contains the
relevant information.
> In all of the above special consideration has to be given to caches;
> obviously a cached document can still be wrong if the original has
> been rectified.
Yes, though it is likely that the time between notifying the owner and
the owner fixing the original will be much greater than the time between
changed original and updated cache.
Now, if I only had the time, we could work-up a solution and implement it.
However, it would still only be a partial solution, since reactive
maintenance can only go so far. But then, if I had the time, I'd implement
SuperMOM and solve all the problems at once. ;-)
.......Roy Fielding ICS Grad Student, University of California, Irvine USA
<[email protected]>
<URL:http://www.ics.uci.edu/dir/grad/Software/fielding>