ChemoCR

Benefits:

 

vectorization algorithm based on textures

 

OCR tool for chemical characters using machine learning

 

expert system for the extraction of chemical entities by combining graphical primitives and chemical knowledge

 

scoring module for the reconstruction validation

 

Software chemoCR

chemoCRTM makes chemical information contained in depictions of chemical structures accessible as connection table for computer programs.

chemoCRTM  makes chemical information contained in depictions of chemical structures accessible as connection table for computer programs.
chemoCRTM  makes chemical information contained in depictions of chemical structures accessible as connection table for computer programs.
chemoCRTM  makes chemical information contained in depictions of chemical structures accessible as connection table for computer programs.

Product information

In order to solve the problem of recognizing and translating chemical structures in image documents, our chemoCRTM system combines pattern recognition techniques with a chemical rule based expert system. The method is based on the idea of identifying the most significant fragments of small molecules from depictions. The workflow consists of three phases: image vectorization, chemical entity extraction and molecule reconstruction.

 

ChemoCR Logo

Scientific Background

In order to solve the problem of recognizing and learning chemical structures in image documents, our chemoCRTM system combines pattern recognition techniques with supervized machine-learning concepts. The method is based on the idea of identifying  the most significant semantic entities (e.g. chiral bonds, super atoms, reaction arrows…) from depictions. The workflow consists of three phases: image preprocessing, semantic entity recognition, and molecule reconstruction plus validation of the result. All steps of the process make use of chemical knowledge in order to detect and fix errors. The system can be adapted to different sets of input images.

The validation module computes several reconstruction scores and highlights parts of the molecule where errors could have occurred.

 

ChemoCR Errordisplay

ChemoCR Errordisplay

Features

ChemoCR Features
  • conversion of various bitmap images (e.g. BMP, GIF, PNG, multi page TIF) into chemical file formats (e.g. SMILES, SDF)
  • GUI for manual curation (cf. Figure below)
  • PDF document processing
  • depictions with multiple molecules can be handled
  • chemical page segmentation of full page scans fully automatic batch processing mode (can be distributed over a cluster)
  • reconstruction of the full bond information (single, double, triple, chiral bonds)
  • recognition of superatoms and their conversion into structural representation
  • scoring scheme for the reconstruction process based on known chemical scaffolds
  • training ability for the OCR process (e.g. fused letters) and teaching new super-atoms
  • customization via easy manipulation of ASCII parameter files
  • chemical intelligence (e.g. filling free valences)
  • recognition of R-groups but not including Markush structures and bridged ring systems

Please contact us if you want to start a collaboration on new features.

Application Fields

The majority of chemical structure information in the literature (including patents) is present as two-dimensional graphical representations. These images can be interpreted very easily by the chemist, but pose a large problem to the computer. For example the following figure shows the same molecule drawn with two different tools. So far the computer cannot perceive this equivalence from the picture itself. Therefore you cannot search for molecules in pictures or index documents with pictures. E.g. try to search the Chemical Structure Lookup Service (CSLS) with an image:

Two depictions of Azithromycin.

On the other hand, if the picture is converted into a connection table, there exist several chemoinformatics algorithms to solve this problem. After the conversion process a lot of information on the molecule can be directly computed or retrieved from chemical data bases. So why not use the corresponding SMILES:

CN1C(C(C(C)(C(OC(C(C(C(C(C(CC(C1)C)(C)O)OC1C(O)C(N(C)C)CC(C)O1)C)OC1CC(OC)(C)C(O)C(C)O1)C)=O)CC)O)O)C

 

In this highly interdisciplinary domain, interesting information is often presented as a combination of text and graphics. Combining textual information extraction methods with chemoCRTM for the multimodal information extraction of Markush structures from patents and from QSAR tables has not been addressed yet.

There is an example patent page showing a Markush structure to left. This functionality will be part of future work.

At the moment we are looking into reaction schemes.

 

 

Please contact us if you want to start a collaboration on these topics.

Challenge

Chemical entities can appear in scientific texts as trivial and brand names, assigned catalog names, or IUPAC names. However, the preferred representation of chemical entities is often a two-dimensional depiction of the chemical structure. Depictions can be found as images in nearly all electronic sources of chemical information (e.g. journals, reports, patents, and web interfaces of chemical data bases).

Nowadays these images are generated with special drawing programs, either automatically from connection table file formats or by the chemist through a graphical user interface. Although drawing programs can produce and store the information in a computer-readable format, chemical structure depictions are published as bitmap images (e.g. GIF for web interfaces or BMP for text documents). As a consequence, the structure information can no longer be used as input to chemical analysis software packages. To make published chemical structure information available in a computer-readable format, images representing chemical structures have to be manually converted by redrawing every structure. This is a time-consuming and error-prone process.

Have some fun redrawing it ...

 

SD file (in ZIP) for the depicted molecule, Download [ZIP, 16.0 KB]

a nicely drawn molecule

a nicely drawn molecule

Technical Specification

The chemoCRTM core functionality is based on platform independent JAVA libraries. Although the licensing mechanism is relying on operating system dependent libraries. We added some prototypic interfaces to external commercial software which can be installed optionally.

It has been extensively tested on UNIXTM operating systems (Fedora Linux, Sun Solaris) and on Windows XPTM. Users may apply our software interactively by a graphical user interface or run it distributed in batch processing mode in a grid enabled hardware environment.

The following configurations have been tested - but the software should not be limited to:

Workstation Configuration

  • i686 architecture or sparc
  • Linux (Fedora 3-6, SuSE 10.0), Windows XP, Solaris 9
  • Java Runtime Environment (JRE) v1.5 and newer by SUN