In reply to Adam Hodgkin,
Why should we explicitly exclude XML and CML? Surely if they have a role to play it is a mistake to try and cover the same ground in HTML 4? Is HTML 4 coming from a different set of concerns than is served by XML?
I did not mean to imply that. What I meant to say was that HTML and XML/CML throw up slightly different issues, and in an over-long email, I wanted to leave XML to a separate posting. Relating to HTML however, there do appear to be some 2 million "chemical" documents out there in that form, but only about 0.5% of that number of "high value" legacy molecular descriptors such as molfiles etc. One presumes that 99.5% of these chemical documents either contain little "marked-up" chemistry or its all disappeared into "unindexable" bit map images. If we managed to replace those 2 million HTML documents and any images they point to with equivalent XML/CML, then the value could go up enormously. For that to happen, the community will have to develop a lot of high quality low cost tools, and the tangible benefits will have to be ardently sold to everyone.
There is a difference between chemistry on the page (explicitly encoded in HTML) and chemistry available through the www (in databases which may be accessed through the www) -- a suitably designed robot could rely on the search capabilities of the databases it was pointed at. Should a chemistry robot trawl for chemically interesting HTML or offer multiple searches of www sites with publicly available databases? Or both? If both, there is likely to be a trade-off between consistency and completeness. Well it wont be the first time.
This is an interesting key point. The model followed for the last 100 years is for one or two organisations to collect chemistry on our behalf, consolidate it behind comprehensive, high quality, but largely closed databases, and then sell the "added value" so accrued back to us. Thus accessing either CAS or Beilstein is not cheap. The cost has historically come from the need to employ literally thousands of human abstracters retrieving chemical information in a largely manual manner, from paper sources. Nowadays, its still the case that almost all "collections" of molecules are hidden behind a variety of custom databases. Some, such as ChemFinder, are nevertheless freely available, but the really large ones are only available if you are well-resourced. Thus the issue of whether HTML or XML/CML should carry parsable chemical information, or whether most of that information should be closed from view in a proprietary database, must surely be one of the key debates of the next few years. There is no doubt for example that journals are rapidly evolving into large, and mostly proprietary databases. The tagged chemical content of these databases is still relatively low, but in one scenario, a few very large publishers might well control almost all access to such databases, and one might imagine in that scenario, that as the "quality" of the chemical access goes up, so will the costs. There might also be a "third way" (a term currently popular in British politics!). Working with XML/CML teaches us the value of molecular components, ie well defined chemical objects. Such object collections clearly needs quality control mechanisms, and a means to create a distributed decentralised system. Whilst its possible to envisage the Web as providing that infra-structure, we found it interesting investigating the "object store" as "third way", and the possibilities of creating a globally distributed "chemical object store". If you want an (approximate) analogy, ponder the amazing infra-structure that has been built up on your behalf to direct you in about 1-2 seconds to the resource behind the simple URL, ie the distributed Domain name server system (DNS). Our experiments in "chemical object stores" or "COS" are to be found at http://www.ch.ic.ac.uk/vchemlab/cos/ Dr Henry Rzepa, Dept. Chemistry, Imperial College, LONDON SW7 2AY; mailto:rzepa@ic.ac.uk; Tel (44) 171 594 5774; Fax: (44) 171 594 5804. URL: http://www.ch.ic.ac.uk/rzepa/ chemweb: A list for Chemical Applications of the Internet. To post to list: mailto:chemweb@ic.ac.uk Archived as: http://www.lists.ic.ac.uk/hypermail/chemweb/ To (un)subscribe, mailto:majordomo@ic.ac.uk the following message; (un)subscribe chemweb List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)