A few notes on text

Gavin Nicol ([email protected])
Mon, 5 Jun 1995 08:47:13 -0400


<!DOCTYPE HTML SYSTEM "html.dtd">
Text in VRML

Text in VRML

There has been considerable discussion recently regarding an appropriate representation of text in VRML. While it is quite obvious that text can be represented as the polygons making up the glyph images, it is equally obvious that some way of representing text directly as strings will result in significant performance benefits.

This document discusses the representation of text within VRML, and presents some tentative designs for font specification, and formatting specification nodes.


Approaches to Multilingual Text in VRML

There are two basic approaches to including text in VRML: one is to allow each text node to have a different character set and encoding, and another is to require each node to be of the same character set and encoding. Each of these is discussed below.


Single Character Set and Encoding

In some ways, this is the simplest case. All text in a "document" (for want of a better word), uses the same character set and encoding. In some ways, it simplifies text node design, but it has a number of problems: in fact, the problems facing this design are almost exactly those facing HTML. Three major problems are:

Character set and encoding specification
In order to parse a "document", the character set and encoding must be known. The logical place to specify such things is in the document itself, which is unparseable without such information. The MIME charset parameter can be used to specify the encoding, but not the character set. For protocols that do not use MIME, other means are needed.
Character set and encoding explosion
Given that arbitrary character sets and encodings need to be allowed, it then means that all parsers must be able to understand all character sets and encodings, or that there will be data available that few parsers can handle.
Character set size
It is often desireable to include multiple languages in a single document. Most character sets do not include a large number of languages.

These three problems combined lead one to look for a simplification, and the perfect solution seems to be to use ISO 10646/Unicode, particularly in some of it's ASCII-compatible encodings (UTF-8 being the best).

In fact, this is a very elegant solution, and is basically what I proposed for HTML in December of last year. However, this solution has some major problems:

  1. Unicode has some very strong opponents, particularly in Japan. Within that group, the networking folk tend to be the most vocal, because the networking in Japan is based on EUC, SJIS, and ISO-2022-JP, and changing the infrastructure involves considerable cost (in terms of time and money).
  2. Many of the aforementioned arguments focus on the unification of characters within Unicode (Unicode unifies certain characters that occur in various Asian languages, including Chinese, Japanese, and Korean, hence this is referred to as "the CJK Unification Problem". The basic argument is that a raw Unicode cannot differentiate the glyph images for the unified characters. This is quite true: some supplementary information is required, though the unified characters are (generally) associated with glyph images that are close enough to each other to be legible in whatever the language happens to be.

    For VRML to use Unicode, designing some format for encoding such information would be necessary.

  3. In addition, many of the opponents argue from emotion: no matter how many times one tries to explain the concepts and methodologies of Unicode, they simply say "I don't agree, because I do not think these characters to be the same. In other words, reasoning is impossible.

The combination of these leads to basically requiring that any multilingual text representation be able to support multiple character sets and encodings, and especially for Internet data types. A sad fact of life.

It is the authors' opinion that eventually, Unicode will become widely adopted. VRML seems ideally suited to solving one major issue: that of fonts. The number of characters in a working set for multilingual text tends to be small. Given a shared repository of glyph images represented as polygons, it seems perfectly feasible for a viewer to use the local fonts for most of the text nodes encountered, and to fetch (and cache) glyph images as necessary. The same concept has been proposed for Unicode use in HTML by Larry Masinter.


Per-node Character Set and Encoding

Below is a proposal for a basic representation of text within VRML. The key features of the proposal are that it is extensible, and that it should allow all text data to be processed on equal terms, regardless of language, character set, or encoding.

Note: The following proposal will require that the MIME type for VRML allow 8 bit data. This requirement can be removed if some encoding for 8 bit data is defined.


The Text Node type

The most fundamental node type is the Text node type. It's basic role is to serve as a container for a string of characters. It is defined as:

Text {
    SFString     coded_character_set
    SFString     encoding
    SFString     language
    SFInteger    string_length
    MFOctet      data
}

Field Descriptions

coded_character_set
This field specifies the coded character set for the textual data. A coded character set defines a mapping from integers to characters (ie. how to identify a character using an integer code).
encoding
This field defines how the application should map from a bit stream to a stream of codes (integers).
language
This field indicates the language. The values should be those found in ISO 639 (and also used in HTML 3.0 LANG tags).
string_length
This field indicates the number of characters in the string. The primary intended use is to aid in calculating display space requirements, and for verifying data content.
data
The series of octets making up the data.

The definition contains a new data type MFOctet, which should probably have a corresponding SFOctet type. These would be defined as:

SFOctet
A single octet, enclosed in quotes. The parser should skip whitespace to the leading quote, read the next octet, and then check for a trailing quote. The quotes are not part of the data. For example:
                foo "A"
          
MFOctet
A data type holding a sequence of octects. This is represented by an opening bracket, followed by ignored whitespace, followed by a number representing the number of octects. The number is followed by ignored whitespace, a quote-enclosed string of octects, followed by ignored whitespace, and a closing bracket. The quotes do not form part of the data stream. For example:
          foo [ 4 "abAB" ]
          

Such a node has the advantage of allowing any representation of text at all, but still remaining parseable by systems unable to handle the character set and encoding used.

It is important to note that this node does nothing more than hold a string of text. All formatting information should be stored elsewhere. Using Tranform and whatnot should allow interesting effects, even with text.


Tentative Formatting Specification

The following is a few notes, and tentative proposals for a few text-related node types. At some point in the future, all the following will need to be specified for VRML, as will a Font node type (for fonts represented as polygons).

Font Node

The following node is for specifying font type information. It should be representable using various font technologies. The fields should be self-explanatory.

FontSpecification {
    SFString     family 
    SFEnum       weight       # VERYLIGHT, LIGHTT, MEDIUM, DEMIBOLD, BOLD
    SFEnum       slant        # ROMAN, ITALICS, OBLIQUE
    SFInteger    point_size
}

Format Specification Node

The following node represents a basic formatting specification node. Basically, it allows the width, height, justification, and colors of the text node to be set.

FormatSpec {
    SFInteger    width
    SFInteger    height
    SFColor      background
    SFColor      foreground
    SFEnum       justification  # FILL_LEADING, FILL_BOTH, FILL_TRAILING
}

The justification field should be independent of language.

Usage

It is envisaged that the nodes be used in the following manner:

TextNode {
    FormatSpec {
         ....
    }
    Text {
         ....
    }
    FontSpecification {
         ....
    }
}

Appendix A. ISO 636 coded

Technical contents of ISO 639:1988 (E/F)
"Code for the representation of names of languages".
Typed by [email protected] 1990-11-30
Two-letter lower-case symbols are used.
The Registration Authority for ISO 639 is Infoterm, Osterreiches
Normungsinstitut (ON), Postfach 130, A-1021 Vienna, Austria.
 
aa Afar
ab Abkhazian
af Afrikaans
am Amharic
ar Arabic
as Assamese
ay Aymara
az Azerbaijani
 
ba Bashkir
be Byelorussian
bg Bulgarian
bh Bihari
bi Bislama
bn Bengali; Bangla
bo Tibetan
br Breton
 
ca Catalan
co Corsican
cs Czech
cy Welsh
 
da danish
de german
dz Bhutani
 
el Greek
en English
eo Esperanto
es Spanish
et Estonian
eu Basque
 
fa Persian
fi Finnish
fj Fiji
fo Faeroese
fr French
fy Frisian
 
ga Irish
gd Scots Gaelic
gl Galician
gn Guarani
gu Gujarati
 
ha Hausa
hi Hindi
hr Croatian
hu Hungarian
hy Armenian
 
ia Interlingua
ie Interlingue
ik Inupiak
in Indonesian
is Icelandic
it Italian
iw Hebrew
 
ja Japanese
ji Yiddish
jw Javanese
 
ka Georgian
kk Kazakh
kl Greenlandic
km Cambodian
kn Kannada
ko Korean
ks Kashmiri
ku Kurdish
ky Kirghiz
 
la Latin
ln Lingala
lo Laothian
lt Lithuanian
lv Latvian, Lettish
 
mg Malagasy
mi Maori
mk Macedonian
ml Malayalam
mn Mongolian
mo Moldavian
mr Marathi
ms Malay
mt Maltese
my Burmese
 
na Nauru
ne Nepali
nl Dutch
no Norwegian
 
oc Occitan
om (Afan) Oromo
or Oriya
 
pa Punjabi
pl Polish
ps Pashto, Pushto
pt Portuguese
 
qu Quechua
 
rm Rhaeto-Romance
rn Kirundi
ro Romanian
ru Russian
rw Kinyarwanda
 
sa Sanskrit
sd Sindhi
sg Sangro
sh Serbo-Croatian
si Singhalese
sk Slovak
sl Slovenian
sm Samoan
sn Shona
so Somali
sq Albanian
sr Serbian
ss Siswati
st Sesotho
su Sudanese
sv Swedish
sw Swahili
 
ta Tamil
te Tegulu
tg Tajik
th Thai
ti Tigrinya
tk Turkmen
tl Tagalog
tn Setswana
to Tonga
tr Turkish
ts Tsonga
tt Tatar
tw Twi
 
uk Ukrainian
ur Urdu
uz Uzbek
 
vi Vietnamese
vo Volapuk
 
wo Wolof
 
xh Xhosa
 
yo Yoruba
 
zh Chinese
zu Zulu