HTML DTD issues

Dan Connolly ([email protected])
Thu, 19 Nov 92 04:37:23 CST


The thrust to register HTML with the authorities has
spurred me to look over the DTD again. I've found
some problems.

1. Currently the NAME attribute of an anchor is declared
as CDATA, i.e. just about anything. There's an SGML thingy
called an ID. SGML parsers enforce uniqueness among the IDs
of a document. Seems like that's what we want for ID names.

But an SGML ID has to start with a letter. So all the
HTML files that use numbers as anchor names will break.

2. I introduced two tag names when I drafted the DTD:
HTML contains the whole document. I defined it
so you can omit both the start and the end tags, so it's
inferred by SGML parsers. I don't think I can avoid some
top-level tag.
DOCUMENT contains most of the "body" -- all the
headings and paragraphs. I did this to avoid something
called mixed content, which causes complications.
I could rename this element as BODY, and introduce a
omitable HEADING tag to surround the TITLE, NEXTID,
and ISINDEX tags.

3. I stuck anchors in as an inclusion, meaning they
could be used just about anywhere. I thought stuff
like
<a name=foo><h1>Foo</h1></a>
was legal, but neither linemode nor the midas browser
groks.

I'm editing the DTD to restrict the usage of anchors
to only contain text strings.

4. The OL tag is disappearing. It's no longer documented
in the web, and it's not supported by MidasWWW. Should
I delete it from the DTD?

5. What about <HP1> thru <HP5>... should we include them?
I'd prefer <em>, <tt>, <cite>, ala TeX. Or we could
go with the O'Reilly/Hal DocBook tags:
<Emphasis>, <OopsChar>, <wordasword>,<CiteBook>,<Subscript>,
<Superscript>.

6. Any more thoughts on the BaseAddress tag?

7. The HTML tags documentation says Listing sections can contain
any ISO Latin 1 characters. The SGML standard mentions ISO 646,
i.e. ascii, as the default, but the sgmls parser, the linemode
browser, and MidasWWW all seem to grok Latin1 just fine.

Dan