In the following please find a list of problems I identified and the solutions
I'm proposing.
Problems identified:
A often a document content is outdated, and this could have been known at the
very moment of the document production.
B often a document is moved to another location, and the only way to know that
it was moved is to read some warning message that the mover was so kind to
leave, if we are lucky also adding a link to the new location; this does
happen, but not so often.
Consequences:
A 1 bandwidth is lost to transfer the outdated document
2 human time is lost retrieving the document and discovering, often after
reading through part of it, that it is no longer valid/useful
3 the databases of documents will grow only adding to old information and not
just replace it with new data
4 the database replies to the searches will be (they are already)
increasingly unmanageable, despite any restrictive condition one can
imagine to apply
5 (second level consequence) the validity of the net for conveniently
retrieving reliable information can be questioned.
B 1 valuable human time is lost to manually follow broken links, jumping around
the net
2 database replies will keep giving for long time incorrect document
locations
With some thinling more bad effects can be added to this list, but probably
I made the idea already clear.
Maybe (I'm not a Web techie, I write this just based on common sense)
the documents aging problem has already been addressed by the various Web
crawlers, repeatedly checking for what has been thrown away, and for what is
new, but I still think that such an approach is not the right solution.
Proposed solutions:
Prefax
All the aging and location information I'm writing about is an information
about the document itself. Therefore the correct location is within the META
elements.
I am personally interested, as a Web pages provider, to fix these problems, or
find out if and how they have already been fixed.
A 1 The document has a single definite expiry date, its contents are uselesss
in any following date. I'm proposing to use the following sintax:
<META NAME="EXPIRY" CONTENT="DD MMM YYYY">
With this information document databases have the possibility to
perform the following actions:
- replying to a database search before the expiry date, they can display its
expiry date together with the relevant document info
- trash any document info after the expiry date
- avoid adding to the database any newly discovered document already
expired
Web browsers can highlight this information for the user (together
with the title?)
A 2 The document has some information with finite lifetime, but will likely be
updated at some later time, (e.g. the program of a theatre).
I'm proposing to use the following sintax:
<META NAME="NEXT_UPDATE" CONTENT="DD MMM YYYY">
With this information document databases have the possibility to
perform the following actions:
- replying to a database search before the expiry date, they can display
the next update date together with the relevant document info
- retrieve again the document info after the update date (e.g the index
words might have changed, because also Puccini has been added to the
theatre program)
Web browsers can highlight this information for the user (together
with the title?)
Of course the writer is not forced to provide this info, BTW using these META
tags he will be sure that the document info will be always regularly updated
and right in time, with no indefinite delays.
B 1 A new small 'placeholder' document shall replace the old one, clearly with
the same name. I'm proposing to use the following sintax:
<META NAME="MOVED" CONTENT="http:etc etc">
Instead of "http:...", 'ftp:..." or whatever applicable can be
used.
With this information document databases have the possibility to
perform the following action:
- update the document links
Web browsers can automatically follow the new link (highlighting the evet
to the user?), and if the old link was also a bookmark, update it.
If a document will presumably be moved at some time in the future, the
association of an UPDATE attribute in the original document and of a MOVED
attribute in
the 'placeholder' document will provide for an immediate update of database
pointers after the date indicated.
Final remarks
I believe that the http servers should also provide two tables for the
crawlers, keeping trace with a daily schedule of what is appearing and
disappearing from the site (do they do that already?). This will at least
filter out some noise related to changes in old documents or in documents not
following this reccommendation. Still, these tables will not be able to suggest
the removal of outdated information, when a file remains 'forgotten' in the
site, and they will not guarantee a 'right in time' update of the databases
when the scheduled document udpate takes place.
What if the update does not take place as scheduled?
Well, I can generate more ideas to trap oddities and codify the behaviour, but
I think I've already written too much. If this thread will go on,
everything will
eventually be ironed out.
Please consider that if now these problems and their consequences are annoying
but still manageable, since the number of Web documents is growing
'exponentially' we are going to face an incredible amount of junk information.
Thanks for your attention, yours faithfully,
Michele Bassan
Via XXIV maggio, 10
35010 Vigonza - Padova - Italy
[email protected]
[email protected]
http://intercity.shiny.it/i3