Re: HTML 4 and Chemistry

27 May 1998

      In reply to  Adam Hodgkin,
...
Why should we explicitly exclude XML and CML? Surely if they have a role to
play it is a mistake to try and cover the same ground in HTML 4? Is HTML 4
coming from a different set of concerns than is served by XML?
I did not mean to  imply that. What I meant to say was that HTML and
XML/CML throw
up slightly different issues, and in an over-long email,  I wanted to leave
XML to
a separate posting.  Relating to HTML however, there do appear to be some
2 million "chemical" documents out there in that form, but only about  0.5%
of that number of "high value" legacy molecular descriptors such as
molfiles etc. One presumes
that 99.5% of these chemical documents either contain little "marked-up"
chemistry
or its all disappeared into  "unindexable" bit map images.  If
we managed to replace those 2 million HTML documents and any images
they point to with equivalent XML/CML,
then the value could go up enormously. For that to happen, the community will
have to develop a lot of high quality low cost tools, and the tangible
benefits will
have to be ardently sold to everyone.
...
There is a difference between chemistry on the page (explicitly encoded in
HTML) and chemistry available through the www (in databases which may be
accessed through the www) -- a suitably designed robot could rely on the
search capabilities of the databases it was pointed at. Should a chemistry
robot trawl for chemically interesting HTML or offer multiple searches of
www sites with publicly available databases? Or both? If both, there is
likely to be a trade-off between consistency and completeness. Well it wont
be the first time.
This is an interesting key point. The model followed for the last  100 years
is for one or two organisations to collect chemistry on our behalf, consolidate
it behind comprehensive, high quality,  but largely closed databases,
and then sell the  "added value" so accrued back to us. Thus accessing either
CAS or Beilstein is not cheap. The cost has historically come from the need to
employ literally thousands of human abstracters retrieving chemical
information in a
largely manual manner,  from paper sources.  Nowadays, its still the case
that almost all "collections"
of molecules are hidden behind a variety of custom databases. Some, such
as  ChemFinder, are nevertheless freely available, but the really large ones
are only available if you are well-resourced. Thus the issue of whether
HTML or XML/CML
should carry parsable chemical information, or whether most of that
information should be
closed from view in a proprietary database, must surely be one of the key
debates of the next few years.  There is no doubt for example that journals are
rapidly evolving into large, and mostly proprietary databases. The tagged
chemical
content of these databases is still relatively low, but in one scenario, a
few very large
publishers might well control almost all access to such databases, and one
might imagine
in that scenario, that as the  "quality" of the chemical access goes up, so
will the costs.

There might also be a "third way" (a term currently popular in British
politics!). Working with
XML/CML teaches us the value of molecular components, ie well defined
chemical objects.
Such object collections clearly needs quality control mechanisms,
and a means to create a distributed decentralised system.  Whilst its
possible to
envisage the  Web as providing that infra-structure, we found it
interesting investigating
the "object store" as "third way", and the possibilities of creating a globally
distributed "chemical object store".  If you want an (approximate)
analogy, ponder the amazing infra-structure that has been built up on your
behalf to
direct you in about  1-2 seconds to the resource behind the simple URL, ie the
distributed Domain name server system (DNS).

Our experiments in "chemical object stores" or "COS" are to be found at

http://www.ch.ic.ac.uk/vchemlab/cos/

Dr Henry Rzepa,  Dept. Chemistry,  Imperial College,  LONDON SW7 2AY;
mailto:rzepa@ic.ac.uk; Tel  (44) 171 594 5774; Fax: (44) 171 594 5804.
URL: http://www.ch.ic.ac.uk/rzepa/ 

chemweb: A list for Chemical Applications of the Internet.
To post to list:  mailto:chemweb@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/chemweb/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe chemweb
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

Re: HTML 4 and Chemistry

Rzepa, Henry