ProMiner

Benefits:

  • Can handle voluminous dictionaries, complex thesauri and large controlled vocabularies derived from ontologies
  • Maps to data base references or ontologies
  • Gene and Protein Name Recognition tested in the international »Critical Assessment of Text Mining in Biology«
 

Software ProMiner

A tool for specific terminology recognition.
It adresses several fundamental issues in named entity recognition in the field of life sciences.

Challenges for named entity recognition in biomedical text

Scientific publications found in abstract data bases, full text journals or patents are the main and most up-to-date information source, but the amount of text is overwhelming for most life science areas. Recognition of life science terminology is a key prerequisite for performing automatic information retrieval and information extraction. Huge and complex terminologies with high numbers of synonymous expressions, ambiguous terminology and numerous generations of new names and classes present named entity recognition with a real challenge. ProMiner is a tool for specific terminology recognition and addresses several fundamental issues in named entity recognition in the field of life sciences:

  • ProMiner can handle voluminous dictionaries, complex thesauri and large controlled vocabularies derived from ontologies
  • regularly updated dictionaries through automatic curation followed by a manualevaluation process
  • mapping of synonyms to reference names and data sources
  • context dependent disambiguation of biomedical termini and resolution of acronyms
  • specific handling of common English word synonyms
  • spelling variants of expressions in the source dictionary can be recognized
  • high speed tagging and parallel workflow for multiple dictionaries
  • incorporation of regular expressions (e.g. for the recognition of SNP rs numbers)
  • full text annotation in XML, HTML or PDF format
  • patent annotation
ProMiner
© Fraunhofer SCAI
Found entities could be indexed, ranked and linked to other data.

Available Dictionaries

Gene and protein name dictionaries for various organisms:

  • human
  • mouse
  • arabidopsis
  • on request: yeast, fly, rat,

Chemical dictionaries:

  • Chembl
  • ChEBI
  • Drugbank

Other biological entities:

  • miRNAs
  • SNPs
  • Diseases
  • Biological processes

Meta Data Terminologies:

  • Taxonomy
  • Uberon (Anatomy)
  • Cell ontology
  • Cell line ontoloy
  • Cell structure

Other dictionaries can be provided on demand or integrated by customers themself

Dictionary independent recognition:

  • machine learning based
  • IUPAC recognition
  • SNP recognition
  • regular expression based
  • rs number recognition
  • chromosomal location

Indexing machinery for fast indexing of huge document resources

Module for named entity recognition in a larger workflow for information extraction

  • Java module with defined input and output streams
  • an annotator service for named entities in the Unstructured Information Management Architecture (UIMA) framework
  • already integrated in the TEMIS - BER information extraction environment software
     

Content generation for the interpretation of large scale experimental data

  • simple output file to fill/supplement data base content
  • linkage to other data is easily possible through the provided mapping to data bases or controlled vocabulary
ProMiner Example
© Fraunhofer SCAI
Relation to experimental data, interaction data bases or propriatory data through the provided mapping.

References

Gurulingappa, H., Klinger, R., Hofmann-Apitius, M., and Fluck, J.: An Empirical Evaluation of Resources for the Identification of Diseases and Adverse Effects in Biomedical Literature. In 2nd Workshop on Building and evaluating resources for biomedical text mining (7th edition of the Language Resources and Evaluation Conference), Valetta, Malta, May 2010

Fluck, Juliane; Mevissen, Heinz Theodor; Dach, Holger; Oster, Marius, and Hofmann-Apitius, Martin: ProMiner: Recognition of Human Gene and Protein Names using regularly updated Dictionaries. Proceedings of the Second BioCreative Challenge Evaluation Workshop, 2007, Critical Assessment of Information Extraction in Molecular Biology, Madrid Spain. Download [PDF, 154 KB]

Karopka T, Fluck J, Mevissen HT, Glass A.: The Autoimmune Disease Database: a dynamically compiled literature-derived database. BMC Bioinformatics. 2006 Jun 27;7:325. http://www.biomedcentral.com/1471-2105/7/325

Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer and Juliane Fluck ProMiner: Rule based protein and gene entity recognition. BMC Bioinformatics 2005, 6(Suppl 1):S14. http://www.biomedcentral.com/1471-2105/6/S1/S14

Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer and Juliane Fluck: ProMiner: Organism specific protein name detection using approximate string matching. Embo Workshop A critical assessment of text mining methods in molecular biology Granada, Spain, March 28 -March 31, 2004. Download [PDF, 141 KB]

Hanisch D, Fluck J, Mevissen HT, Zimmer R.: Playing biology's name game: identifying protein names in scientific text. 2003 Pac Symp Biocomput. 403-14. Download [PDF, 270 KB]