ProMiner

© Photo Fraunhofer SCAI

Results in the international “critical assessment of text mining in biology” (BioCreAtIvE I and II).

The performance of ProMiner recognition of gene and protein names was tested in the international “Critical Assessment of Text Mining in Biology” (BioCreAtIvE I and BioCreAtIvE II). ProMiner was benchmarked against other industrial and academic named entity recognition tools. Updated and new generated dictionaries are continually evaluated in industrial applications.

Challenges for named entity recognition in biomedical text

Scientific publications found in abstract data bases, full text journals or patents are the main and most up-to-date information source, but the amount of text is overwhelming for most life science areas. Recognition of life science terminology is a key prerequisite for performing automatic information retrieval and information extraction. Huge and complex terminologies with high numbers of synonymous expressions, ambiguous terminology and numerous generations of new names and classes present named entity recognition with a real challenge. ProMiner is a tool for specific terminology recognition and addresses several fundamental issues in named entity recognition in the field of life sciences:

  • ProMiner can handle voluminous dictionaries, complex thesauri and large controlled vocabularies derived from ontologies
  • regularly updated dictionaries through automatic curation followed by a manualevaluation process
  • mapping of synonyms to reference names and data sources
  • context dependent disambiguation of biomedical termini and resolution of acronyms
  • specific handling of common English word synonyms
  • spelling variants of expressions in the source dictionary can be recognized
  • high speed tagging and parallel workflow for multiple dictionaries
  • incorporation of regular expressions (e.g. for the recognition of SNP rs numbers)
  • full text annotation in XML, HTML or PDF format
  • patent annotation
© Photo Fraunhofer SCAI

Found entities could be indexed, ranked and linked to other data.

Indexing machinery for fast indexing of huge document resources

Customer feedback on ProMiner:

  • “We are amazed about its speed and ability to work with large input files.”
  • “…impressive to get the combination of information in text with enriched background and experimental data.”
     

Module for named entity recognition in a larger workflow for information extraction

  • Java module with defined input and output streams
  • an annotator service for named entities in the Unstructured Information Management Architecture (UIMA) framework
  • already integrated in the TEMIS - BER information extraction environment software
     

Content generation for the interpretation of large scale experimental data

  • simple output file to fill/supplement data base content
  • linkage to other data is easily possible through the provided mapping to data bases or controlled vocabulary
© Photo Fraunhofer SCAI

Relation to experimental data, interaction data bases or propriatory data through the provided mapping.

  • gene and protein name dictionaries for various organisms:
    • human
    • mouse
    • arabidopsis
    • on request: yeast, fly, rat,
  • gene ontology dictionary
  • mesh disease dictionary
  • organism name dictionary
  • drug name/metabolite dictionary

While parts of the life science terminology could be found with the help of dictionaries in some entity classes, it is not possible to enumerate all names. Examples are IUPAC names or SNP rs numbers. Here, we offer other techniques integrated as plugin into ProMiner:

  • machine learning based
  • IUPAC recognition
  • SNP recognition
  • regular expression based
  • rs number recognition
  • chromosomal location

Technical Specification and Parameter settings

  • ProMiner is available for UNIX/Linux and Microsoft Windows
  • a scheduler for

  • ProMiner supports ASCII text, MEDLINE format, XML, HTML and PDF full text
  • output format as: meta-information, XML tagged text and HTML output

Annotation of PDF

The increasing number of electronically available full text publications offers the ability to process these documents and annotate the knowledge stored in them.
Integrated in the ProMiner software, we offer a special PDF plugin for the annotation in PDF documents. Here the annotations are directly written into the PDF output format.
For semantic search and visualization, we offer the semantic search engine SCAIView.
Further information about SCAIView: www.scai.fraunhofer.de/scaiview.

Gurulingappa, H., Klinger, R., Hofmann-Apitius, M., and Fluck, J.: An Empirical Evaluation of Resources for the Identification of Diseases and Adverse Effects in Biomedical Literature. In 2nd Workshop on Building and evaluating resources for biomedical text mining (7th edition of the Language Resources and Evaluation Conference), Valetta, Malta, May 2010

Fluck, Juliane; Mevissen, Heinz Theodor; Dach, Holger; Oster, Marius, and Hofmann-Apitius, Martin: ProMiner: Recognition of Human Gene and Protein Names using regularly updated Dictionaries. Proceedings of the Second BioCreative Challenge Evaluation Workshop, 2007, Critical Assessment of Information Extraction in Molecular Biology, Madrid Spain. Download [PDF, 154 KB]

Karopka T, Fluck J, Mevissen HT, Glass A.: The Autoimmune Disease Database: a dynamically compiled literature-derived database. BMC Bioinformatics. 2006 Jun 27;7:325. http://www.biomedcentral.com/1471-2105/7/325

Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer and Juliane Fluck ProMiner: Rule based protein and gene entity recognition. BMC Bioinformatics 2005, 6(Suppl 1):S14. http://www.biomedcentral.com/1471-2105/6/S1/S14

Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer and Juliane Fluck: ProMiner: Organism specific protein name detection using approximate string matching. Embo Workshop A critical assessment of text mining methods in molecular biology Granada, Spain, March 28 -March 31, 2004. Download [PDF, 141 KB]

Hanisch D, Fluck J, Mevissen HT, Zimmer R.: Playing biology's name game: identifying protein names in scientific text. 2003 Pac Symp Biocomput. 403-14. Download [PDF, 270 KB]