Re: meta information

Roy T. Fielding ([email protected])
Fri, 03 Jun 1994 01:36:08 -0700


I'd like to respond to all the messages on this thread, but I'll limit
myself for the sake of brevity (and saving people's mailboxes ;-) ...

Nick's ideas about semantic tagging are good but well beyond the capabilities
of the META element -- if we try to do too many things with META it will
lose its primary value (i.e. easy to parse). I think Bert's reply about
embedded SGML definitions and Dan's regarding HyTime's typed links are
better long-term solutions for semantic tagging.

My needs are a little more pressing in that I need a method which will work
with HTML 2.0 compliant browsers -- if the spec is written such that content
in the HEAD being rendered == non-compliance, then I could live with just
treating all tags inside <HEAD></HEAD> as being metainfo names (since I can
assume that non-compliant browsers will be fixed in a hurry).

Dan wrote:

> Gee... this thing has really blown up. I see three issues:
>
> 1. How does the author express stuff that the server should
> use and stick in the HTTP headers? My answer:
> <EXPIRES http>...</expires>
> or, until implementations are fixed,
> <EXPIRES http content="...">

How does the author know that the only purpose for that is the HTTP headers?
It is possible that multiple tools may be applied to that information.
Should the authors have to change all of their HTML files every time a new tool
is introduced?

>...
> I see three choices for extensions that don't break current and
> future clients:
> 1. special comment syntax, like the server-side includes
> in NCSA's httpd. These are really only a good idea if they
> get eliminated before the stuff goes over the wire, since
> a client doesn't know when it peeks into a comment and sees
> its special syntax whether the author put it there on
> purpose or as a coincidence.
>
> 2. processing instructions. Great for private applications.
>
> 3. New element names. It has worked so far.

Side note: They don't work at all when the elements contain content
which should not be rendered as normal text. Wouldn't it
be nice if we had a general mechanism for telling clients
to ignore a particular element's content if the tag is unknown?
I guess we'll have to wait for full SGML parsing within the client.

>...
> How many WWW implementations don't include the "skip tags you
> don't recognize" convention? I don't believe you have to write
> code each time you want to _ignore_ another tag. And I don't believe
> you can _act_ on a new tag _without_ writing more code.

On the contrary, you certainly can if the tag follows an identifiable pattern.
For instance, this is what MOMspider does while traversing a document:

1. GET it via the URL
2. Extract all the links and metainfo, e.g. (look out, it's Perl)

# ===========================================================================
# extract_links(): Extract the document metainformation and links from a
# page of HTML content and return them for use by a traversal program.
#
# The parameters are:
# $base = the URL of this document (for fixing relative links);
# *headers = the %table of document metainformation (should already
# contain some information from the HTTP response headers);
# *content = the $page of HTML (this will be DESTROYED in processing);
# *links = the return @queue of absolute child URLs (w/o query or tag);
# *labs = the return @queue of absolute child URLs (with query or tag);
# *lorig = the return @queue of original child HREFs;
# *ltype = the return @queue of link types, where
# 'L' = normal link,
# 'I' = IMG source,
# 'Q' = link or source containing query information,
# 'R' = redirected link (used elsewhere).
#
# Uses ideas from Oscar's hrefs() dated 13/4/94 in
# <ftp://cui.unige.ch/PUBLIC/oscar/scripts/html.pl>
#
sub extract_links
{
local($base, *headers, *content, *links, *labs, *lorig, *ltype) = @_;
local($link, $orig, $elem);

$content =~ s/\s+/ /g; # Remove all extra whitespace and newlines
$content =~ s/<!--.*-->//g; # Remove all SGML comments (I hope)

$content =~ s#<TITLE>\s*([^<]+)</TITLE>##i; # Extract the title
if ($1) { $headers{'title'} = $1; }

$content =~ s/^[^<]+</</; # Remove everything before first element
$content =~ s/>[^<]*</></g; # Remove everything between elements (text)
$content =~ s/>[^<]+$/>/; # Remove everything after last element

# Isolate all META elements as text
$content =~ s/<meta\s[^>]*name\s*=\s*"?([^">]+)[^>]*value\s*=\s*"?([^">]+)[^>]*>/M $1 $2\n/gi;
$content =~ s/<meta\s[^>]*value\s*=\s*"?([^">]+)[^>]*name\s*=\s*"?([^">]+)[^>]*>/M $2 $1\n/gi;
# Isolate all A element HREFs as text
$content =~ s/<a\s[^>]*href\s*=\s*"?([^">]+)[^>]*>/A $1\n/gi;
# Isolate all IMG element SRCs as text
$content =~ s/<img\s[^>]*src\s*=\s*"?([^">]+)[^>]*>/I $1\n/gi;

$content =~ s/<[^>]*>//g; # Remove all remaining elements
$content =~ s/\n+/\n/g; # Remove all blank lines

#
# Finally, construct the link queues from the remaining list
#

foreach $elem (split(/\n/,$content))
{
if ($elem =~ /^A (.*)$/)
{
$orig = $1;
push(@lorig, $orig);
$link = &absolutely($base, $orig);
push(@labs, $link);
$link =~ s/#.*$//;
if ($link =~ s/\?.*$//)
{
push(@ltype, 'Q');
}
else
{
push(@ltype, 'L');
}
push(@links, $link);
}
elsif ($elem =~ /^I (.*)$/)
{
$orig = $1;
push(@lorig, $orig);
$link = &absolutely($base, $orig);
push(@labs, $link);
$link =~ s/#.*$//;
if ($link =~ s/\?.*$//)
{
push(@ltype, 'Q');
}
else
{
push(@ltype, 'I');
}
push(@links, $link);
}
elsif ($elem =~ /^M\s+(\S+)\s+(.*)$/)
{
$link = $1; # Actually the metainformation name
$orig = $2; # Actually the metainformation value
$link =~ tr/A-Z/a-z/;
$headers{$link} = $orig;
}
else { warn "A mistake was made in link extraction"; }
}
}
# ===========================================================================

(sorry about that, but a page of Perl is worth a thousand words ;-)

3. MOMspider then passes the %headers on to the index output routine which
can (given the proper option) just output all of the headers via

foreach $hd (keys(%headers))
{
print "$hd: $headers{$hd}\n";
}

Note that at no time does the program need to know exactly what tag
names are used in the META elements -- it just acts as a filter.

Coincidentally, this is also the type of behavior we would want
from a server, with the exception that it should reject meta names
that are equal to those generated normally by the server response
(i.e. "date" and "last-modified"). Since the server knows what those
names are, it can do the check without knowing any other META names.

> ...
>
> What's the difference between ignoring
> <expires content="...">
> and ignoring
> <meta name="expires" value="...">
> ???

Nothing for a normal client, but quite a bit for a parser which is just
looking for identifiable metainfo to be passed on to some other process.
Yes, it's certainly doable, but I see no semantic difference between the
two forms above and the second is vastly easier to parse.

> And if you're going to flag one as an error, why wouldn't you flag
> the other as an error?

If META was in the DTD, it would not be flagged as an error no matter how
many new names were invented for metainfo.

>...
> Hmmm... isn't this about like declaring an element X with
> attributes A1, A2, ... up to, oh, let's say A9. Use them
> for whatever you like. But beware of certain conventions
> about how they're used that aren't expressed in SGML, even
> though they could be...
>
> Come on! It's not that tough to maintain the DTD as a community,
> is it? Do we _have_ to escape out of SGML all over the place?

If all clients parsed SGML and all clients kept track of DTD versions
and all HTML files contained a pointer to their particular DTD version,
then this type of argument would make sense. However, that is certainly
not the case. Currently, the "official" DTD has not been updated for
more than a year. This is not surprising given the constraints of our
community and I don't expect the updates after 2.0 to be any more rapid.

Why is it that document structure is only identified via element tags?
Why is <META name="Expires" value="..."> any less SGML kosher than
<HEAD><EXPIRES>...</EXPIRES></HEAD> ? The latter would allow other HTML
elements to be embedded within the content (which may or may not be good),
but I see no difference other than that. In fact, given that most document
authors can't even manage to get the <HEAD></HEAD> right, I am reluctant
to depend on it for parsing.

Maybe I should write a spider that traverses the web and abuses server
owners whenever it encounters an invalid HTML structure ;-)
Ahh, if only I had the time ......

....Roy Fielding ICS Grad Student, University of California, Irvine USA
([email protected])
<A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>