Re: The future of meta-indices/libraries

Stan Letovsky ([email protected])
Thu, 17 Mar 1994 19:24:40 --100


Webmeisters:

I have been thinking about the issue of indexing of distributed
information services for some time, and I must say I find the recent
discussion of the issue this past week lacking in ambition, although
no doubt pragmatic, in a low-level, head-in-the-bits sort of way. As
an antidote to this pragmatism I would like to offer a grand,
impractical vision, and some perhaps less unrealistic steps that could
be taken in the direction of that vision.

VISION:

"A place for everything and everything in its place."

Knowing my question, I should know where to look for the answer,
or in practice, I should be able to follow an algorithm that
will lead me to the answer in time proportional to the log of the
total size of the Web, or to a certain conclusion that the answer
is not in the Web. This type of capability could be called
fact-hashing.

This proposal emphasizes access to answers (facts), not documents.
Documents are what are currently served, and will probably persist for
the forseeable future, but because they package facts in arbitrary
ways they will always be part of the problem, an obstacle to random
fact-access, not the solution. In particular, documents need not have
any internal indexing structure, and there is no limit to redundancy
between documents, which means there are no canonical addresses for
facts (UFLs? :-). A large web may have a lot of information, or it may
have a small amount of information redundantly expressed. Redunduncy
leads to index inflation -- each query gets many hits -- which makes
information harder to access, not easier. Better one totally relevant
hit than a thousand somewhat relevant hits which then have to be
searched again by hand.

Where can we look for models for a fact-hashing technology, let alone
a distributed one? The discussion of this past week seemed concerned
primarily with reproducing the (failed :-) library technologies of the
past in the Web: subject indexes, title-indexing, keyword-indexing of
documents, indexing of hand-assigned document keywords, etc. This
would make of the Web a distributed on-line library: as good, perhaps,
but no better than, conventional electrically indexed libraries. My
challenge to the Webmasters of the world are these: THAT IS NOT GOOD
ENOUGH! and WE CAN DO BETTER!

The model I view as the starting point for a discussion of
super-indexing is the semantic network, popular in AI research for
representing conceptual information. In these representations the
distinction between indexing and content largely vanishes; the
fraction of the total knowledge devoted to indexing increases to the
point where most of the content of a fact is represented in its
address. The fact that Clyde is an elephant, to use an ancient and
trivial example, is represented simply by locating the Clyde node
under the elephant category in the network. Extensions of this idea
allow complex assertions to made simply by adding small amounts of new
graph-structure to an existing (Web-like?) maze of pointers. Crucial
to this idea is that information is not added to the system like water
into a bucket; it is placed carefully in finely-discrminated
pigeonholes.

Imagine if the card catalog of the Library of Congress extended
without any discontinuity into the indexes and tables of contents of
all the books, and imagine further that the books were fragmented into
their component individual ideas, and that a great hand came along and
squeezed all the redundancy out of these ideas so that each existed in
a single canonical copy, and you have a picture of the kind of Web
index I would like to see. Another model is the CYC project at MCC,
which is attempting to encode large bodies of commonsense knowledge
in machine-usable semantic networks. Could we make a Web-CYC conceptual
index to all Web-accessible knowledge?

REALITY

OK, so after decades of AI research we still really don't know how to
build semantic networks very well, and we know less about how to use
them as interactive indexes to encyclopedic knowledge bases (although
there is a fair amount of literature on this), and even less
about how to construct distributed networked versions of them
on the Web. Can we take a small step in this direction? Perhaps.
A small step, for me, would be a distributed topic-index that
allowed, at least in principle, for index nodes at arbitrarily
fine granularity.

A subject index is a coarse-grained partitioning of knowledge, along
the lines of <A
HREF="http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html">WWW
Virtual Library</A>. One problem with this is that it bails out too
early into a list of servers, E.g. I go down to the biology level, and
then I get a list of biology servers, perhaps with some
subcategorization. Some of these servers have internal WAISindexing,
which helps, but I don't know of any subject-domains which have set up
domain-wide WAISindexing, so there is a gap in between index-browsing
and index-searching which I must fill by hand, by iteratively
searching each potential source. If the sources are not WAISindexed
the situation is worse.

A topic-index, as I use the term (I am not sure if the term has an
official meaning, or if it is the same) is a recursive subject index
that goes down to very fine topic granularities. An upper node might
be "biology", but 10 levels down we might have "regulation of cell
division in eukaryotes", and under that "known mutants" (by species),
"regulatory proteins", etc. At this level a topic index starts to look
like a micro-review article. Often the best subindex for a topic
takes the form of a diagram: a map of the US with states provides a
geographic subindex; an anatomical diagram, a flow-chart, a metabolic
diagram, etc. all provide topic breakdowns. Such diagramatic subtopic
indexes could easily be embedded in today's Web, as could simpler
textual ones.

How could topic indexes be maintained? There are sociological and
technological components to the answer. The sociological answer is
that every topic has a curator (group), and an associated network
host. The amount of responsibility associated with curation could vary
at the curator's discretion, depending on how much effort they wanted
to put into annotating the topic description with diagrams, review
text, etc. The resource commitment involves serving up a single,
probably not large, html document, and the associated CPU load, which
might be balanced by having duplicate servers for popular topics if
neeed. Other curator responsibilities would include finding a
successor when you quit, if needed, and identifying opportunities to
split the topic into subtopics.

The other main task associated with a topic server is finding links to
documents and subtopics that should be included in the topic document.
This could be done in automatic or semiautomatic ways, such as by
hitchhiking on the indexing mechanisms discussed by earlier
respondents. For example a document could include or have associated
with it a topic-list, which a Web-robot would encounter and notify the
appropriate topic server to include a pointer. The curators may want
to intervene to impose some discretionary judgement at this point, or,
if they don't want to expend the effort, they could let the process
run automatically. Certain rules might givern the behavior of the
indexing robots, e.g. documents are only indexed with the most
specific relevant subtopics; documents cannot be indexed in too many
topics (some idiot wants his picture accessible from all
subtopics...), some topics allow subtopics but not documents, etc. The
total effect would be something like internet bboards, but without the
temporal/sequential bias, and with dynamic reorganization, splitting
of subthreads, erasure of obsolete or irrelevant documents, etc.,
as well as ongoing local maintenance of documents by the owners. A
persistent nested bboard, if you like.

The astute reader should now ask, How is this different from simply
having keywords associated with documents, and having a single
Veronica/ALIWEB-style server gather all the keywords and document
addresses into a single database, which is then searched? Isn't this
actually better, since the user can define particular combinations of
keywords on the fly that may not have been anticipated as a single
topic? The answer is that the crucial flaw with that scheme is that
there is no mechanism for coordination of keyword assignments. You say
tomato and I say tomatoe and our documents are not properly
correlated. Keyword systems, even those based on hand-assigned
keywords, are excessively optimistic about the correlation between
keywords and facts, as an earlier post on reliability statistics
of keyword search pointed out. To leave out of our indexing scheme
any system for coordination of keywords is to leave a hole big enough
for a bandersnatch to pass through. A topic-curator system would allow
this problem to be partitioned among a responsible community in a
nonburdensome manner.

The above objection does suggest an alternative approach to
topic indexing, however: that the role of a topic curator is to
maintain an annotated list of keywords to be used to index documents
in that topic area. That would impose some discipline on the indexing
that would improve the usefulness of the index database. If there is a
script that turns the keywords in the curators' (html) keyword into
lists live links that search the index database, you get back to
something very much like the first proposal, with a slightly different
implementation -- central storage of index info rather than
distributed.

I could go on (inverse citation index databases, topic-wide WAIS-indexes...)
but I suspect no one will make it this far...

Let's aim high, gang.

-Stan

http://cgsc.biology.yale.edu/stan.html