chemoCR

Tool for Chemical Compound Reconstruction

chemoCRTM makes chemical information contained in depictions of chemical structures accessible as connection table for computer programs.

In order to solve the problem of recognizing and translating chemical structures in image documents, our chemoCRTM system combines pattern recognition techniques with a chemical rule based expert system. The method is based on the idea of identifying the most significant fragments of small molecules from depictions. The workflow consists of three phases: image vectorization, chemical entity extraction and molecule reconstruction.

 

In order to solve the problem of recognizing and learning chemical structures in image documents, our chemoCRTM system combines pattern recognition techniques with supervized machine-learning concepts. The method is based on the idea of identifying  the most significant semantic entities (e.g. chiral bonds, super atoms, reaction arrows…) from depictions. The workflow consists of three phases: image preprocessing, semantic entity recognition, and molecule reconstruction plus validation of the result. All steps of the process make use of chemical knowledge in order to detect and fix errors. The system can be adapted to different sets of input images.

SCAI and its partners have developed

  • a new vectorization algorithm based on textures
  • a new OCR tool for chemical characters using machine learning
  • a new expert system for the extraction of chemical entities by combining graphical primitives and chemical knowledge
  • a scoring module for the reconstruction validation

cf. references

The validation module computes several reconstruction scores and highlights parts of the molecule where errors could have occurred.

The chemoCRTM core functionality is based on platform independent JAVA libraries. Although the licensing mechanism is relying on operating system dependent libraries. We added some prototypic interfaces to external commercial software which can be installed optionally.

It has been extensively tested on UNIXTM operating systems (Fedora Linux, Sun Solaris) and on Windows XPTM. Users may apply our software interactively by a graphical user interface or run it distributed in batch processing mode in a grid enabled hardware environment.

The following configurations have been tested - but the software should not be limited to:

Workstation Configuration

  • i686 architecture or sparc
  • Linux (Fedora 3-6, SuSE 10.0), Windows XP, Solaris 9
  • Java Runtime Environment (JRE) v1.5 and newer by SUN

Cluster Configuration (additionally)

  • Job scheduling system (e.g. SGE v6.0u3)
  • FLEXlm server v10.8 by Macrovision

Software which has been interfaced

  • Java Advanced Imaging by SUN
  • MarvinBeans by ChemAxom
  • LexiChem (mol2name) by http://www.eyesopen.com OpenEye
  • International Chemical Identifier (InChITM) IUPAC

Snapshot of the Graphical User Interface (GUI) of chemoCRTM. In the left panel the input bitmap can be seen. In the right panel the reconstructed molecules are shown. Expanded superatom groups are highlighted by colors. The bridged rings have a small error - but we are working on it.

  • conversion of various bitmap images (e.g. BMP, GIF, PNG, multi page TIF) into chemical file formats (e.g. SMILES, SDF)
  • GUI for manual curation (cf. Figure below)
  • PDF document processing
  • depictions with multiple molecules can be handled
  • chemical page segmentation of full page scans fully automatic batch processing mode (can be distributed over a cluster)
  • reconstruction of the full bond information (single, double, triple, chiral bonds)
  • recognition of superatoms and their conversion into structural representation
  • scoring scheme for the reconstruction process based on known chemical scaffolds
  • training ability for the OCR process (e.g. fused letters) and teaching new super-atoms
  • customization via easy manipulation of ASCII parameter files
  • chemical intelligence (e.g. filling free valences)
  • recognition of R-groups but not including Markush structures and bridged ring systems

Please contact us if you want to start a collaboration on new features.

Documentation

Productsheet, Download [PDF, 450 KB]
Chemical Page Segmentation Poster, Download [PDF, 1.2 MB]
Overview Poster, Download [PDF, 1.6 MB]

The majority of chemical structure information in the literature (including patents) is present as two-dimensional graphical representations. These images can be interpreted very easily by the chemist, but pose a large problem to the computer. For example the following figure shows the same molecule drawn with two different tools. So far the computer cannot perceive this equivalence from the picture itself. Therefore you cannot search for molecules in pictures or index documents with pictures. E.g. try to search the Chemical Structure Lookup Service (CSLS) with an image:

Two depictions of Azithromycin.

On the other hand, if the picture is converted into a connection table, there exist several chemoinformatics algorithms to solve this problem. After the conversion process a lot of information on the molecule can be directly computed or retrieved from chemical data bases. So why not use the corresponding SMILES:

CN1C(C(C(C)(C(OC(C(C(C(C(C(CC(C1)C)(C)O)OC1C(O)C(N(C)C)CC(C)O1)C)OC1CC(OC)(C)C(O)C(C)O1)C)=O)CC)O)O)C

So chemoCRTM is for

  • retrieval
  • indexing
  • property prediction

of chemical molecules in depictions.

In this highly interdisciplinary domain, interesting information is often presented as a combination of text and graphics. Combining textual information extraction methods with chemoCRTM for the multimodal information extraction of Markush structures from patents and from QSAR tables has not been addressed yet.

There is an example patent page showing a Markush structure to left. This functionality will be part of future work.

At the moment we are looking into reaction schemes.

 

 

Please contact us if you want to start a collaboration on these topics.

Problem Description

Chemical entities can appear in scientific texts as trivial and brand names, assigned catalog names, or IUPAC names. However, the preferred representation of chemical entities is often a two-dimensional depiction of the chemical structure. Depictions can be found as images in nearly all electronic sources of chemical information (e.g. journals, reports, patents, and web interfaces of chemical data bases).

Nowadays these images are generated with special drawing programs, either automatically from connection table file formats or by the chemist through a graphical user interface. Although drawing programs can produce and store the information in a computer-readable format, chemical structure depictions are published as bitmap images (e.g. GIF for web interfaces or BMP for text documents). As a consequence, the structure information can no longer be used as input to chemical analysis software packages. To make published chemical structure information available in a computer-readable format, images representing chemical structures have to be manually converted by redrawing every structure. This is a time-consuming and error-prone process.

Have some fun redrawing it ...

SCAI Bioinformatics has a longtime working experience on the automated extraction of information from biomedical literature. Based on our experience in the field of biological information extraction (cf. BER and ProMiner ), we recently extended the scope of our research towards chemical entity recognition.

Here some references on this topic:


Chemical Structure Recognition:

  • Framework for Extracting Rotation Invariant Features for Image Classification and an Application using Haar Wavelets; S. Akle, M.-E. Algorri and M. Zimmermann; WSEAS Intern. Conferences, Univ. of Cambridge, February 2008.
  • Chemical Structure Recognition via an expert system guided graph exploration; Peter Kral; Diploma Thesis; Ludwig-Maximilians-Universitaet, Muenchen, 2007. Download [PDF, ZIP 2.8 MB]
  • Combating Illiteracy in Chemistry: Towards Computer-Based Chemical Structure Reconstruction; M. Zimmermann, C. M. Friedrich and M. E. Algorri; 1st German Conference on Chemoinformatics, (2005), Goslar.
  • CSR - Chemical Structure Recognition from Images; Y. Wang, L. T. Bui Thi, C. M. Friedrich, M. Zimmermann, M. Algorri, H. Noltemeier, M. Hofmann; German Conference on Bioinformatics (GCB 2005), Hamburg
  • Combating Illiteracy in Chemistry: Towards Computer-Based Chemical Structure Reconstruction; M. Zimmermann, Le T. Bui Thi, M. Hofmann; ERCIM News No. 60, (2005) 40-41. Download [PDF, 217 KB]
  • Graph-Rekonstruktion im Rahmen chemischer Strukturrepraesentationen; Le Thuy Bui Thi; Diplomarbeit; Bayerische Julius-Maximilians-Universitaet, Wuerzburg, 2005. Download [PDF, 1.2 MB]
  • Kekule: Ocr-optical chemical (structure) recognition; R. McDaniel and Jason R. Balmuth. J. Chem. Inf. Comput. Sci., 32(4):373-378, 1992.
  • Chemical Literature Data Extraction: The CLiDE Project; P. Ibison, M. Jacquot, F. Kam, A. G. Neville, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson, J. Chem. Inf. Comput. Sci., vol. 33(3): 338-344, 1993.
  • Optical recognition of chemical graphics; S. Boyer, Document Analysis and Recognition, Proceedings of the Second International Conference on Publication, 627-631, 20-22 Oct 1993.


Information Extraction:

  • Identification of New Drug Classification Terms in Textual Resources; C. Kolarik, M. Hofmann-Apitius, M. Zimmermann and J. Fluck; ISMB/ECCB, (2007), Vienna.
  • Mining chemical structural information from the drug literature; D.L. Banville, DDT(2006), 11(1):35-42, 2006.
  • Information extraction in the life sciences: perspectives for medicinal chemistry, pharmacology and toxicology; M. Zimmermann, J. Fluck, le T. BuiThi, C. Kolarik, K. Kumpf, M. Hofmann; Curr Top Med Chem. (2005); 5(8):785-96. PMID: 16101418 Download [PDF, 165 KB]
  • Information Extraction Technologies for the Life Science Industry; J. Fluck, M. Zimmermann, G. Kurapkat, M. Hofmann; Drug Discovery Today-Technologies (2005), 2(3):217-24. Download [PDF, 278 KB]


Selected Talks: