The search engine www.hotbot.com has in the last day introduced the facility to search for all documents which reference e.g. a .pdb file. This returns the result of 4882 for .pdb, 966 for .xyz, 840 for mol, 463 for csml etc. This can be combined with keyword searches.
It is simple to imagine how a database could be constructed quite quickly from a web robot sent out to retrieve these files.
Now, the only problem is to assure ourselves that no other discipline has also used this namespace, not to mention whether everyone assumes the same structure for eg xyz or mol
This has been done already. See http://schiele.organik.uni-erlangen.de/services/webmol.html for a discussion of these sorts of problems. The short answer is that other disciplines *have* used these namespaces, but that it is save to assume that any file that does not generate an error in a good PDB reader is a valid PDB file. It is important to keep in mind, however, exactly what this sort of search reveals. It will give a list of files that are *individually referenced* on other WWW pages. It will not give any indication of files that are themselves accessible only through another search engine. For example, I think the PDB itself has ore than 4500 entries. These files are not included in the numbers above. The ~11,000 structures we have at http://chemfinder.camsoft.com is probably in the top handful of sites that have chemical structures, but it's not the largest (quality is a different matter; I'm just talking size). I don't think any of these are indexed at hotbot or anywhere else. It is possible to create an index of chemical information on the WWW -- that is what we are doing at chemfinder.camsoft.com. It is not possible, however, to automate the process fully. A discussion of some of the roadblocks to complete automation is on the web at http://www.camsoft.com/chemfinder/errorsfound.html Jonathan Brecher CambridgeSoft Corporation jsb@camsoft.com ----- chemweb: A list for Chemical Applications of the Internet. Archived as: http://www.ch.ic.ac.uk/hypermail/chemweb/ To unsubscribe, send to listserver@ic.ac.uk the following message; unsubscribe chemweb List coordinator, Henry Rzepa (rzepa@ic.ac.uk)