ProMiner
Fraunhofer Institute for Algorithms and Scientific Computing SCAI
Recognition and normalization of named entities in scientific text

- Highlighting in text corresponding to the entity classes.
Up to date information about biomedical entities like genes, proteins, diseases or drugs is often not found in structured databases but rather in scientific text. For specific information retrieval or information extraction the recognition of these terms and their normalisation to database entries (e.g. gene names to ENTREZ-GENE) or structured vocabulary/ontologies (e.g. GO/MESH/UMLS) is a prerequisite. The need of normalisation implies the usage of dictionaries generated from these sources and the inclusion of direct mappings. As databases and ontologies are evolving rapidly, automated updating and processing is needed to generate comprehensive and specific dictionaries. The high ambiguity of terms and acronyms used in the Life Science domain complicates precise recognition further.
- Challenges
- Proof of Performance
- Application Fields
- Available dictionaries
- Dictionary independent recognition
- Text format
- References
Challenges
Challenges for named entity recognition in biomedical text
Scientific publications found in abstract databases, full text journals or patents arethe main and most up-to-date information source, but the amount of text is overwhelming for most Life Science areas.Recognition of Life Science terminology is a key prerequisite for performing automatic information retrieval and information extraction.Huge and complex terminologies with high numbers of synonymous expressions, ambiguous terminology and numerous generations of new names and classes present named entity recognition with a real challenge.ProMiner is a tool for specifi c terminology recognition and addresses several fundamental issues in name entity recognition in the fi eld of Life Sciences:
- ProMiner can handle voluminous dictionaries, complex thesauri and large controlled vocabularies derived from ontologies
- Regularly updated dictionaries through automatic curation followed by a manualevaluation process
- Mapping of synonyms to reference names and data sources
- Context dependent disambiguation of biomedical termini and resolution of
acronyms - Specific handling of common English word synonyms
- Spelling variants of expressions in the source dictionary can be recognized
- High speed tagging and parallel workflow for multiple dictionaries
- Incorporation of regular expressions (e.g. for the recognition of SNP rs numbers)
- Full text annotation in XML, HTML or PDF format
- Patent annotation

- Found entities could be indexed, ranked and linked to other data.
Proof of Performance

- Results in the international “critical assessment of text mining in biology” (BioCreAtIvE I and II).
The performance of ProMiner recognition of gene and protein names was tested in
the international “critical assessment of text mining in biology” (BioCreAtIvE I and
BioCreAtIvE II). ProMiner was benchmarked against other industrial and academic
named entity recognition tools. Updated and new generated dictionaries
are continually evaluated in industrial applications.
Application Fields
Indexing machinery for fast indexing of huge document resources
Customer feedback on ProMiner:
- “We are amazed about its speed and ability to work with large input fi les”.
- “…impressive to get the combination of information in text with enriched background and experimental data”.
Module for named entity recognition in a larger workfl ow for information extraction
- Java module with defi ned input and output streams
- An annotator service for named entities in the Unstructured Information Management Architecture (UIMA) framework
- Already integrated in the TEMIS - BER information extraction environment software
Content generation for the interpretation of large scale experimental data
- Simple output fi le to fi ll/supplement database content
- Linkage to other data is easily possible through the provided mapping to databases or controlled vocabulary

- Relation to experimental data, interaction data bases or propriatory data through the provided mapping.
Available dictionaries
- Gene and protein name dictionaries for various organisms:
- Human
- Mouse
- Arabidopsis
- On request: Yeast, Fly, Rat,
- Gene ontology dictionary
- Mesh disease dictionary
- Organism name dictionary
- Drug name/metabolite dictionary
Dictionary independent recognition
While parts of the Life Science terminology could be found with the help of dictionaries in some entity classes, it is not possible to enumerate all names. Examples are IUPAC names or SNP rs numbers. Here, we offer other techniques integrated as plugin into ProMiner:
- Machine learning based
- IUPAC recognition
- SNP recognition
- Regular expression based
- rs number recognition
- chromosomal location
Technical Specifi cation and Parameter settings
- ProMiner is available for UNIX/Linux and Microsoft Windows
- A scheduler for
Text format
- ProMiner supports ASCII text, MEDLINE format, XML, HTML and PDF full text
- Output format as: Meta-information, XML tagged text and HTML output
Annotation of PDF
The increasing number of electronically available full text publications offers the
ability to process these documents and annotate the knowledge stored in them.
Integrated in the ProMiner software, we offer a special PDF plugin for the
annotation in PDF documents. Here the annotations are directly written into the
PDF output format.
For semantic search and visualization, we offer the semantic search engine SCAIView.
Further information about SCAIView: www.scai.fraunhofer.de/scaiview.
References
Fluck, J., Mevissen T.H., Dach, H., Oster M., Hofmann-Apitius M.
ProMiner: Recognition of Human Gene and Protein Names using regularly updated Dictionaries. Second BioCreAtIvE Challenge Workshop 2006, Critical Assessment of Information Extraction in Molecular Biology, Madrid Spain. Download PDF, 154 KB
Karopka T, Fluck J, Mevissen HT, Glass A. The Autoimmune Disease Database: a dynamically compiled literature-derived database. BMC Bioinformatics. 2006 Jun 27;7:325. http://www.biomedcentral.com/1471-2105/7/325
Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer and Juliane Fluck ProMiner: Rule based protein and gene entity recognition. BMC Bioinformatics 2005, 6(Suppl 1):S14. http://www.biomedcentral.com/1471-2105/6/S1/S14
Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer and Juliane Fluck ProMiner: Organism specific protein name detection using approximate string matching. Embo Workshop A critical assessment of text mining methods in molecular biology Granada, Spain, March 28 -March 31, 2004. Download PDF
Hanisch D, Fluck J, Mevissen HT, Zimmer R. Playing biology's name game: identifying protein names in scientific text. 2003 Pac Symp Biocomput. 403-14. Download PDF

Set Bookmark