Fraunhofer Institute for Algorithms and Scientific Computing SCAI
chemoCR - Tool for Chemical Compound Reconstruction
chemoCRTM makes chemical information contained in depictions of chemical structures accessible as connection table for computer programs.
In order to solve the problem of recognizing and translating chemical structures in image documents, our chemoCRTM system combines pattern recognition techniques with a chemical rule based expert system. The method is based on the idea of identifying the most significant fragments of small molecules from depictions. The workflow consists of three phases: image vectorization, chemical entity extraction and molecule reconstruction.
- Application Fields
- Scientific Background
- Technical Specification
Chemical entities can appear in scientific texts as trivial and brand names, assigned catalog names, or IUPAC names. However, the preferred representation of chemical entities is often a two-dimensional depiction of the chemical structure. Depictions can be found as images in nearly all electronic sources of chemical information (e.g. journals, reports, patents, and web interfaces of chemical data bases).
Nowadays these images are generated with special drawing programs, either automatically from connection table file formats or by the chemist through a graphical user interface. Although drawing programs can produce and store the information in a computer-readable format, chemical structure depictions are published as bitmap images (e.g. GIF for web interfaces or BMP for text documents). As a consequence, the structure information can no longer be used as input to chemical analysis software packages. To make published chemical structure information available in a computer-readable format, images representing chemical structures have to be manually converted by redrawing every structure. This is a time-consuming and error-prone process.
Have some fun redrawing it ...
a nicely drawn molecule ;-)
SD file for the depicted molecule, Download (SDF, 16.0 K)
The majority of chemical structure information in the literature (including patents) is present as two-dimensional graphical representations. These images can be interpreted very easily by the chemist, but pose a large problem to the computer. For example the following figure shows the same molecule drawn with two different tools. So far the computer cannot perceive this equivalence from the picture itself. Therefore you cannot search for molecules in pictures or index documents with pictures. E.g. try to search the Chemical Structure Lookup Service (CSLS) with an image:
Two depictions of Azithromycin.
On the other hand, if the picture is converted into a connection table, there exist several chemoinformatics algorithms to solve this problem. After the conversion process a lot of information on the molecule can be directly computed or retrieved from chemical data bases. So why not use the corresponding SMILES:
So chemoCRTM is for
- property prediction
of chemical molecules in depictions.
In this highly interdisciplinary domain, interesting information is often presented as a combination of text and graphics. Combining textual information extraction methods with chemoCRTM for the multimodal information extraction of Markush structures from patents and from QSAR tables has not been addressed yet.
There is an example patent page showing a Markush structure to left. This functionality will be part of future work.
At the moment we are looking into reaction schemes.
Please contact us if you want to start a collaboration on these topics.
In order to solve the problem of recognizing and learning chemical structures in image documents, our chemoCRTM system combines pattern recognition techniques with supervized machine-learning concepts. The method is based on the idea of identifying the most significant semantic entities (e.g. chiral bonds, super atoms, reaction arrows…) from depictions. The workflow consists of three phases: image preprocessing, semantic entity recognition, and molecule reconstruction plus validation of the result. All steps of the process make use of chemical knowledge in order to detect and fix errors. The system can be adapted to different sets of input images.
SCAI and its partners have developed
- a new vectorization algorithm based on textures
- a new OCR tool for chemical characters using machine learning
- a new expert system for the extraction of chemical entities by combining graphical primitives and chemical knowledge
- a scoring module for the reconstruction validation
The validation module computes several reconstruction scores and highlights parts of the molecule where errors could have occurred.
Snapshot of the Graphical User Interface (GUI) of chemoCRTM. In the left panel the input bitmap can be seen. In the right panel the reconstructed molecules are shown. Expanded superatom groups are highlighted by colors. The bridged rings have a small error - but we are working on it.
- conversion of various bitmap images (e.g. BMP, GIF, PNG, multi page TIF) into chemical file formats (e.g. SMILES, SDF)
- GUI for manual curation (cf. Figure below)
- PDF document processing
- depictions with multiple molecules can be handled
- chemical page segmentation of full page scans
- fully automatic batch processing mode (can be distributed over a cluster)
- reconstruction of the full bond information (single, double, triple, chiral bonds)
- recognition of superatoms and their conversion into structural representation
- scoring scheme for the reconstruction process based on known chemical scaffolds
- training ability for the OCR process (e.g. fused letters) and teaching new super-atoms
- customization via easy manipulation of ASCII parameter files
- chemical intelligence (e.g. filling free valences)
- recognition of R-groups but not including Markush structures and bridged ring systems
Please contact us if you want to start a collaboration on new features.
The chemoCRTM core functionality is based on platform independent JAVA libraries. Although the licensing mechanism is relying on operating system dependent libraries. We added some prototypic interfaces to external commercial software which can be installed optionally.
It has been extensively tested on UNIXTM operating systems (Fedora Linux, Sun Solaris) and on Windows XPTM. Users may apply our software interactively by a graphical user interface or run it distributed in batch processing mode in a grid enabled hardware environment.
The following configurations have been tested - but the software should not be limited to:
- i686 architecture or sparc
- Linux (Fedora 3-6, SuSE 10.0), Windows XP, Solaris 9
- Java Runtime Environment (JRE) v1.5 and newer by SUN
Cluster Configuration (additionally)
- Job scheduling system (e.g. SGE v6.0u3)
- FLEXlm server v10.8 by Macrovision
Software which has been interfaced
- Java Advanced Imaging by SUN
- MarvinBeans by ChemAxom
- LexiChem (mol2name) by http://www.eyesopen.comOpenEye
- International Chemical Identifier (InChITM) IUPAC
Status and Availability
Some conclusions can be drawn from the first evaluations:
- a generic chemoCR framework has been established
- there is and there will not be a one-fits-all solution
- chemoCR can be adapted and optimized (parameters, error models, image preprocessing)
- although we have looked into many examples, we have not seen so far all sorts of image sources (e.g. legacy of old documents)
- we will continuously improve our methods as new challenges come along
Getting hands on chemoCR:
- you can get hands on experience on chemoCR in an evaluation project
- you can visit our institute for a live demo
- SCAI provides training, installation support, bug fixing, fitting chemoCR to the data
- SCAI has a long term research agenda
in order to get into contact: please send us some typical example images.
SCAI Bioinformatics has a longtime working experience on the automated extraction of information from biomedical literature. Based on our experience in the field of biological information extraction (cf. BER and ProMiner ), we recently extended the scope of our research towards chemical entity recognition.
Here some references on this topic:
Chemical Structure Recognition:
- Framework for Extracting Rotation Invariant Features for Image Classification and an Application using Haar Wavelets; S. Akle, M.-E. Algorri and M. Zimmermann; WSEAS Intern. Conferences, Univ. of Cambridge, February 2008.
- Chemical Structure Recognition via an expert system guided graph exploration; Peter Kral; Diploma Thesis; Ludwig-Maximilians-Universitaet, Muenchen, 2007. Download (PDF.ZIP 2.8 M)
- Combating Illiteracy in Chemistry: Towards Computer-Based Chemical Structure Reconstruction; M. Zimmermann, C. M. Friedrich and M. E. Algorri; 1st German Conference on Chemoinformatics, (2005), Goslar.
- CSR - Chemical Structure Recognition from Images; Y. Wang, L. T. Bui Thi, C. M. Friedrich, M. Zimmermann, M. Algorri, H. Noltemeier, M. Hofmann; German Conference on Bioinformatics (GCB 2005), Hamburg
- Combating Illiteracy in Chemistry: Towards Computer-Based Chemical Structure Reconstruction; M. Zimmermann, Le T. Bui Thi, M. Hofmann; ERCIM News No. 60, (2005) 40-41. Download (PDF, 217 K)
- Graph-Rekonstruktion im Rahmen chemischer Strukturrepraesentationen; Le Thuy Bui Thi; Diplomarbeit; Bayerische Julius-Maximilians-Universitaet, Wuerzburg, 2005.Download (PDF, 1.2 M)
- Kekule: Ocr-optical chemical (structure) recognition; R. McDaniel and Jason R. Balmuth. J. Chem. Inf. Comput. Sci., 32(4):373-378, 1992.
- Chemical Literature Data Extraction: The CLiDE Project; P. Ibison, M. Jacquot, F. Kam, A. G. Neville, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson, J. Chem. Inf. Comput. Sci., vol. 33(3): 338-344, 1993.
- Optical recognition of chemical graphics; S. Boyer, Document Analysis and Recognition, Proceedings of the Second International Conference on Publication, 627-631, 20-22 Oct 1993.
- Identification of New Drug Classification Terms in Textual Resources; C. Kolarik, M. Hofmann-Apitius, M. Zimmermann and J. Fluck; ISMB/ECCB, (2007), Vienna.
- Mining chemical structural information from the drug literature; D.L. Banville, DDT(2006), 11(1):35-42, 2006.
- Information extraction in the life sciences: perspectives for medicinal chemistry, pharmacology and toxicology; M. Zimmermann, J. Fluck, le T. BuiThi, C. Kolarik, K. Kumpf, M. Hofmann; Curr Top Med Chem. (2005); 5(8):785-96. PMID: 16101418 Download (PDF, 165 K)
- Information Extraction Technologies for the Life Science Industry; J. Fluck, M. Zimmermann, G. Kurapkat, M. Hofmann; Drug Discovery Today-Technologies (2005), 2(3):217-24. Download (PDF, 278 K)
- Fraunhofer-Symposium for Text Mining 2008, Download (PDF, 900 K)
- The International Conference in Trends for Scientific Information Professionals 2007, Download (PDF, 1.6M)
- Discovery Knowledge & Informatics 2007, Download (PDF, 2.3 M)
- Discovery Knowledge & Informatics 2006, Download (PDF, 1.6 M)
- Fraunhofer-Symposium for Text Mining 2006, Download (PDF, 1.6 M)
- Fraunhofer-Symposium for Text Mining 2005, Download (PDF, 1.6 M)