Re: SGML and HTML

Daniel W. Connolly ([email protected])
Fri, 09 Dec 1994 11:28:20 -0600


In message <[email protected]>, Brian Farrar writes:
>
>> I'm certain their must be an FAQ answer someplace that succinctly describes
>> the differences and
>> similiarities of SGML and HTML. Any pointers from anyone?

OK... I'll bite... this should go in an HTML FAQ somewhere...
maybe it already is...

It's not a matter of differences and similarities, the way I see it:

HTML is an application of SGML, the way LaTeX is an application of
TeX, or the way the MS macro set is an application of troff, or the
way differential equasions are an application of set theory.

Some folks have said HTML is a subset of SGML. You could look at it
that way: the set of HTML documents is a subset of the set of SGML
documents.

Each SGML document has three parts: an SGML declaration, a prologue,
and an instance. The prologue is often called the DTD, and for the
sake of this discussion, we'll let that slide.

The DTD specifies a document type. Part of the specification of a
document type is a sort of grammar that gives the order and occurence
of the elements; e.g. "A Book shall consist of a preface and one or
more chapters."

The instance must _conform_ to the DTD. This business of conformance
can be checked by machine. (This is probably the handiest feature of
SGML over something like troff or TeX).

So the set of SGML documents looks like

{ (decl, dtd, instance) : decl is an SGML declaration and
dtd is an SGML DTD and
instance is an SGML instance and
instance conforms to dtd }

For HTML, the decl and DTD are fixed; so the set of HTML documents
looks like:

{ (html-decl, html-dtd, instance) : decl is the HTML SGML decl and
html-dtd is the HTML DTD and
instance is an SGML instance and
instance conforms to html-dtd }

OK... so much for theory.

In practice, popular software that deals with HTML (e.g. NCSA Mosaic)
doesn't support all the features of SGML. There are a few obscure bugs
here and there, and there are a few major omissions.

*** Entity management: Most of the omissions relate to the fact that
SGML in general allows a prologue to have more than just a DTD, and it
allows a document to consist of more than one entity (think of an
entity as a file for now). You can sort of "customize" the DTD on a
per-document basis. So while popular HTML software will only deal with
this prologue:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

(and this is a happy coincidence: they deal with it by ignoring it.)

a conforming SGML parser will let you write:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" [
<!entity buyer "Widget Co.">
<!entity seller "Gadget Co.">
<!entity agreement SYSTEM "agreement.html">
]>
&agreement;

where agreement.html looks something like:

<title>Agreement between &buyer; and &seller;</title>

<h1>Terms and Conditions</h1>

<ol>
<li>&buyer; agrees not to shoot &seller;.
<li>&seller; agrees not to shoot &buyer;.
<li>&buyer; agrees to give all their money to &seller;.
<li>&seller; agrees to give &buyer; some stuff.
</ol>

*** Marked sections: a conforming SGML parser will deal with markup
like:

<![ IGNORE [ lksjdflkjs<tags> data whatever ]]>

and ignore it. You can also write:

<![ CDATA [ <tags>, <!-- junk, &foo;, blah ]]>

and everything between the []'s will be treated as regular data
characters: the string '<tags>' won't be treated as a tag at all.

Another use of marked sections is in combination with parameter
entities, kinda like #defines and #ifdefs in C:

The prologue for some SGML document might look like:

<!doctype foo PUBLIC "-//foo corp//DTD foo//EN" [
<!entity % in-house "IGNORE">
]>

Then, in the instance, you might see:

blah blah blah <![ %in-house; [ See Henry for details on how
this works here at foo corp. ]]>

All the in-house marked sections can be turned on and off by changing the
in-house entity declaration in the prologue. Some SGML parsers, namely
SGMLS, support a command-line switch for this, just like -D on a cc command.
So you could get all the in-house stuff with:

% sgmls -iin-house foo.sgm

So popular HTML implementations are like C compilers that don't let
you use the C preprocessor, or like a LaTeX conversion program that
barfs if you define your own TeX macros.

That doesn't mean that you can't feed real live conforming SGML
documents to popular HTML implementations. The programs you gave to a
C compiler that didn't support cpp would still be valid C programs:
they'd just be painful to write.

Unfortunately, unlike this hypothetical C compiler, popular HTML
implementations also eat documents that are not valid SGML documents
at all.

First, they allow some kinds of syntax errors, like:

<a href=foo/bar/baz.html>

which should be:

<a href="foo/bar/baz.html">

Also, popular HTML implementations don't check the order and occurence
of elements with respect to any particular DTD. There is a DTD for HTML
under discussion by the HTML Working Group of the IETF. See

http://www.hal.com/~connolly/html-spec/

for details.

There are certain markup idioms, like:

<dl>
<dt><h3> used H3 to get the font I like</h3>
<dd> some text
</dl>

that the current HTML DTD doens't allow. The DTD for HTML _could_ be
constructed to allow such idioms, but I don't think that would be a
good idea, and most of the folks in the working group agree with me.

You might say "but that markup works fine on all the browsers I've
seen." My answer is that this is a happy coincidence, but no browser
should be _required_ to support that sort of thing -- and you
shouldn't _expect_ it to work with tools that may be developed in the
future.

In the future, we'd like folks to be able to build browsers that, for
example, display a table of contents of your document along side the
main text window. If folks use H3 just for font changes, then a TOC
display would look silly.

So that's my take on the difference between SGML in general, HTML in
theory, and HTML in practice.

Dan