Corpora for Chemical Entity Recognition

Fraunhofer Institute for Algorithms and Scientific Computing SCAI

Corpora for Named Entity Recognition of Chemical Compounds

The test corpus described in [Kolarik et al. 2008] is provided in the following format:

  • Each Entry starts with a # followed by its PMID number
  • The columns:
  1. Token
  2. Start Index
  3. End Index
  4. Full untokenized Entities
  5. Class (B-class|I-class|O)
    • B- means: Beginning of an entity
    • I- means: Continuation of an entity
    • O means: None of the defined entities

 

The corpora from [Klinger et al. 2008] do not include the untokenized entities and has a differently formatted header (starting with #).

 

[Kolarik et al. 2008] Corinna Kolářik, Roman Klinger, Christoph M. Friedrich, Martin Hofmann-Apitius, and Juliane Fluck. Chemical Names: Terminological Resources and Corpora Annotation. In Workshop on Building and evaluating resources for biomedical text mining (6th edition of the Language Resources and Evaluation Conference), Marrakech, Morocco, 2008

 

[Klinger et al. 2008] Roman Klinger, Corinna Kolářik, Juliane Fluck, Martin Hofmann-Apitius, and Christoph M. Friedrich. Detection of IUPAC and IUPAC-like Chemical Names. Bioinformatics, 24(13):i268-i276, 2008.

To download the corpus, please fill out the form below.