Re: Compression/Tokens/Big Files/Small Files/and rubber ducks

Jon Green ([email protected])
Thu, 1 Jun 1995 07:51:23 +0100 (BST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Manfred Krauss: "3D I/O devices and 3D interaction with VRML?"
Previous message: Jim Race: "Compression/Tokens/Big Files/..blah, blah...."

X-Value: UKP 0.0127

I don't think that we should be making assumptions about compression
methods in use by users. By using a gzip-based encoding, we generate
about the best compromise between time overhead and compression effficency
we're going to get without using data-knowledgable methods (typically,
encodings which elide information known to be null). Bear in mind, DKMs
don't often provide that much better a result anyway, and often are worse
than a good general adaptive algorithm like Lempel-Ziv-Welch.

Yes, there will be a slight performance hit if a well-compressed file
is transmitted incrementally compressed, say by CSLIP or modem
encoding. This is because the file's first encoding is close to
entropy (where the data acquires a random character and becomes
impossible to compress further). The slight hit comes from the 2nd
compression algorithm detecting entropy, and adding a little extra data
to indicate that the remainder is being passed uncompressed. Most
sensible compression techniques will do this, otherwise data volume
can grow massively. It's not sufficient to worry about, and certainly
shouldn't be a factor in deciding our method. If it was, downloading
zipped files (or any "noisy" files) of any significant size would be
infeasible for modem owners.

Let's go with zipping. It's a low-impact addition to the spec, uses known
and trusted technology and doesn't need weeks or months to decide on the
exact tokenisations.

Another demerit for token use is that it creates a rolling overhead on
specification upgrades - every time a new token is introduced, its
"magic number" must be published, along with the details on how the new
node is to be encoded. Furthermore, unless a tokenising specification
is well- written, it can cause problems with parsers which do not know
how to handle additions they have not been coded for. Extra
information must be embedded in the data to inform unknowing parsers
how to ignore the token. For example, if you know a keyword takes four
two-byte parameters, you don't need to state the fact. If you don't
recognise the token, however, you'll need a hint to skip the next eight
bytes before resuming scanning. This extra information reduces the
efficiency of the compression you're trying to achieve. On the other
hand, an adaptive compression algorithm can all but eliminate the
redundancy's effect on data sizes without losing the information, and
without needing tortuous specification extensions.

Damn - this asbestos suit's too stiff, I can't reach the zipper to
fasten it :-)

-- [email protected] Hyphen home page: http://www.hyphen.com/ [email protected] And mine: http://www.hyphen.com/html/jonsg/ PGP key available on request Opinions here are mine, not Hyphen's

The Greeks Had A Word For It, no. 12: "Canape", the love of small snacks

Next message: Manfred Krauss: "3D I/O devices and 3D interaction with VRML?"
Previous message: Jim Race: "Compression/Tokens/Big Files/..blah, blah...."