Last time, i18n options

Jan Hardenbergh ([email protected])
Mon, 24 Apr 95 16:25:00 E


Option #1 - say we use the ASCII character set. (c_set is loose term!)
Option #2 - say we use the ISO 8859/1 character set.
Option #3 - say we use ISO 10646 (looks like the way HTML is going)

I've never seen ANY subject glaze eyes like i18n. However, it is REALLY
important. And, I'm not the best person to be proposing stuff since my
knowledge predates Unicode & ISO 10646. However, I've done a little
bit of homework and found these messages in the html-wg archives:

HTML Character Representation/Transmission - Model Glenn Adams
http://www.acl.lanl.gov/HTML_WG/html-wg-95q1.messages/0907.html

Some other intersting comments.
http://www.acl.lanl.gov/HTML_WG/html-wg-95q1.messages/0917.html

Dan Connolly, sheparding the (IETF) HTML working group more or less agrees.
http://www.acl.lanl.gov/HTML_WG/html-wg-95q1.messages/0942.html
> It is also consistent with the proposal that everybody use Unicode.
> Using Unicode is a sufficient, but not necessary mechanism. In
> the MIME-SGML world, I don't believe "Everyone must use Unicode"
> is an acceptable solution. For HTML, it appears to be.

Unicode and ISO 10646 are the same thing in at least some dimensions,
however, I think Unicode is encoded in two bytes and ISO 10646 thinks
of everyything as integers (no bytes needed) [my working assumption]

What would it mean to say we use ISO 10646? Well, ISO Latin 1 is a subset
(and ASCII is a subset of that). So, any characters in these character
sets are OK, but they are considered short hand for the integers - the
code points, they represent. So, "A" is shorthand for &#49 - an integer
in a byte. Any characters that are greater than 256, must be represented
as integers with "&#" in front of them. (&#811 is the greek Beta char?)

Of course, that will get boring someday. People will want to have a 2 byte
encoding to save on space, but that raises byteswapping and various other
issues that we are blissfully ignorant of in our "ASCII" files.

More complicated options exist. ISO-2022/EUC, being able to switch
character sets in midstring, having 8859/8 (greek?) as the default.
If you think any of these options will be "right", sticking to option #1
now is the safest, since ASCII is a subset of most of schemes.

For us yanks, all of the above options are the same. For Western Europeans,
option #2 is better. For people wanting to put kanji,
dingbats, or greek, or whatever, option #3 means thay can do it now -
may not be pretty, but it is possible.

This makes so little difference in what will do, but what we say we
are doing will have ramifications for a long time.

I get the feeling no one cares about this and I am boring the list to
tears, I'm sorry, I'm done now.

YON, [email protected], Jan C. Hardenbergh, Oki Advanced Products 508-460-8655
http://www.oki.com/people/jch/ =|= 100 Nickerson Rd. Marlborough, MA 01776
Imagination is more important than knowledge - Albert Einstein (1879-1955)