I think compression schemes like this have the benefit of small size with
fast parsing. But if parsing speed is not an issue using gzip gives
excellent results and works without having to define a new format.
If parsing speed is an issue then any scheme developed should have the
size of each node along with the node token. We learned in the Inventor
binary format that knowing the size helps tremendously. The CGM (Computer
Graphics Metafile) spec could help us out here. It has a nice token
definition that goes something like this (all from memory so probably
inaccurate):
- Token is 16 bits.
- First 3 bits is a class type (drawing primitive, control, etc.)
- Next 8 bits is a token id within the class.
- Next 5 bits is a length.
- If the length field is 0-30, that's the length of words that follow
- If the length is 31, then read the next word to get the length
For the "long" format (where the length is in the next word) there's a
provision for concatenating length words to get really long lengths.
The idea is to try to compact information based on frequency (in CGM lots
of primitives were less than 30 words long). Similar to a Huffman code
but easier to parse. It also has a similar concept for representing
numeric quantities as short integers, fixed point numbers or floats.
There may be other formats dealing with similar problems that we could
leverage.
-- chris marrin Silicon http://www.sgi.com/Products/WebFORCE/WebSpace (415) 390-5367 Graphics ," http://reality.sgi.com/employees/cmarrin/ [email protected] Inc. b` , ,,. mP b" , 1$' ,.` ,b` ,` :$$' ,|` mP ,` ,mm ,b" b" ,` ,mm m$$ ,m ,,`P$$ m$` ,b` .` ,mm ,.`'|$P ,|"1$` ,b$P ,,` :$1 b$` ,$: :,`` |$$ ,:` $$` ,|` ,$$,,`"$$ .` :$| b$| _m$`,:` :$1 ,:` ,$Pm|` ` :$$,..;"' |$: P$b, _;b$$b$1" |$$ ,,`` ,$$" ``' $$ ```"```'" `"` `""` ""` ,P` "As a general rule, don't solve puzzles that open portals to Hell."-...-' - excerpt from "A Horror Movie Character's Survival Guide"