UBO-OPS

Logo UBO

The Open PHACTS project aims to deliver a single view across available pharmacological information sources. Several approaches integrating life science data using Semantic Web technologies have been described in the literature. However, these approaches have largely ignored the vast amount of content only available within the scientific literature. Results from text are vital as they often offer more up-to-date information than found in curated databases as well as have information that might not be included in such databases. A key goal of Open PHACTS is to provide the capability to query over textual and databases data together.

Main tasks on Open PHACTS

UBO deals with major aspects of information retrieval (WP1), named entity recognition and text mining in various sources of scientific text (WP4). The technology focus lies on content analysis and representation in the context of semantic web (identification and evaluation of sources for chemical and pharmaceutical concepts) and data generation through automated procedures (text mining; image mining; relationship mining). Based on open source technology, a data generation pipeline for semantic content is developed. Chemical vocabularies and pattern recognition algorithms are adapted and productively used in information retrieval, entity recognition and information extraction. The main contributions of UBO will be in WPs 1, 2 and 4.

Logo Open PHACTS

Open PHACTS (Open Pharmacological Concepts Triple Store) is a knowledge management project of the Innovative Medicines Initiative (IMI), a unique partnership between the European Community and the European Federation of Pharmaceutical Industries and Associations (EFPIA).

University of Bonn (UBO)

The Bonn-Aachen International Center for Information Technology (B-IT) is a center for research and education jointly established between the University of Bonn, the Technical University of Aachen and the Fraunhofer Society. The Life Science Informatics curriculum at B-IT is lead by Martin Hofmann-Apitius, who – at the same time – is Head of the Department of Bioinformatics at Fraunhofer Institute for Algorithms and Scientific Computing (SCAI). In the Open PHACTS project consortium, Martin Hofmann-Apitius and his team participate through their affiliation with the University of Bonn. Concepts and results of the Open PHACTS project are presented in the curriculum of B-IT.

Overview of UBO's expected contributions and activities

This page summarizes the contributions and activities that the Life Science Informatics Department of the Bonn-Aachen International Center for Information Technology (B-IT) expects to accomplish. The B-IT is a center for research and education jointly established between the University of Bonn, the Technical University of Aachen and the Fraunhofer Society. This document first outlines the role of UBO (University of Bonn, being the legal organization) within the project and then lists the expected main contributions and gives the deliverables these contributions are related to.

Role of UBO within Open PHACTS

UBO deals with major aspects of information retrieval, named entity recognition and text mining in various sources of scientific text. The focus of the technology developments lies on content analysis (identification and evaluation of sources for chemical and pharmaceutical entities) and data generation through automated procedures (text mining; image mining). Based on open source technology (UIMA), a data generation pipeline is established. Chemical vocabularies and pattern recognition algorithms will be developed and derived from the central mapping service (IMS) and productively used in information retrieval, entity recognition and information extraction. The goal is to link text mining results directly with other semantic web resources via the Open PHACTS infrastructure (the linked data cache).The main contributions of UBO are in WPs 1 (17.5 PM), 2 (6 PM) and 4 (45 PM).

UBO is dedicating a full-time bioinformatics programmer over the full course of the project. Dr. Zimmermann, being an experienced group leader at Fraunhofer SCAI, is coordinating the work of UBO. In addition the entire know how of the research group of the Department of Bioinformatics at Fraunhofer SCAI is available for the work planned in Open PHACTS. We will bring in our experience in:

  • Multimodal information extraction (text + image, ProMiner+chemoCR)
  • Bio-medical & chemical dictionary creation
  • Document annotation & retrieval (SCAIView)
  • Large scale content production (UIMA, UNICORE)

Expected contributions

The main contribution of UBO is the structuring of unstructured information resources, i.e. harvesting triples within the public literature.

  • WP 1: we use our named entity recognition (NER) methods to harvest new concepts from the literature in order to build/enrich ontologies and dictionaries/thesauri. We use our methods for synonym, acronym identification and disambiguation for cross mapping of concepts.
  • WP 2: We see our main focus in this task in reviewing, curating and validating of triples/ nanopublications from unstructured resources and providing methods on displaying triples on the original literature.
  • WP 4: We will make use of WP 1 vocabularies for concept mapping, triple extraction using NER. We setup and integrate an extraction pipeline based on UIMA, allowing integrating further tools and methods from other partners. We focus on full text issues which will need more sophisticated methods to extract relevant triples than for abstracts (eg. zoning). We are most interested in user interaction and feedback loops in order to define and check metrics/quality issues with our text mining methods. We hope that through Open PHACTS we can identify and define new business models for publishers, as of now we have serious legal issues concerning automated full text mining.

Focus for the first six to nine months

A rapid demonstrator can be compiled from our existing MedLine index. We could instantiate the first linked data cache (LDC) with about 10 different biomedical semantic classes tagged in the whole MedLine. We could directly compare our curated dictionaries to the existing resources of the IRS (ConceptWiki+BridgeDB: checking for overlap, errors and missing concepts). We can directly compare the retrieval results from the LDC to our retrieval system SCAIView. We do have methods for annotating concepts within PDF documents.

Expectations and deliverables

Based on the four objectives UBO has a high preference for collaborating / taking responsibility on the following deliverables within WP4:

  • D4.3.13 Deliver report identifying and prioritizing unstructured data (text) sources that can bridge gaps in the OPS information model. Month 9
  • D4.3.14 Deliver report defining the pipeline for automated text extraction to nanopublications (including provenance/annotation) using existing text mining algorithms allowing third party tagging and extraction tools. Month 9
  • D4.3.15 Pipeline from D4.3.14 integrated as a prototype in the general OPS system. Develop model for nanopublication extraction from relevant full text journals via interaction with one or more journal publishers. Month 18
  • D4.3.16 Develop model for nanopublication extraction from relevant full text journals via interaction with one or more journal publishers. Month 18
  • D4.3.17 A fully automatic production system supporting extraction, editing, citation, metrics, annotation and validation. Month 36

RDF Guidelines

The RDF guidelines is a "How To" about providing data to the Open PHACTS project. Actually the guidelines are not published, but a link will be provided as soon as they are openly available.

SWAT4LS 2012 - Poster

Carina Haupt, Andra Waagmester, Egon Willighagen, Marc Zimmerman: A use case of text mining and its integration to Open PHACTS linked data cache, International School on Semantic Web Applications and Technologies for the Life Sciences 2012

SWAT4LS 2012 - Poster [PDF, KB]

Logo SWAT4LS

ISWC 2011 - Poster

Carina Haupt, Paul Groth and Marc Zimmermann: Representing Text Mining Results for Structured Pharmacological Queries, The 10th International Semantic Web conference, 2011

ISWC 2011 - Poster [PDF, KB]

Logo ISWC

Technical Specification

The RDF data is created by using the UIMA HPC framework.
 

UIMA HPC

The goal of the research project UIMA-HPC is to automate and hence speed-up the process of knowledge mining in patents. Multi-threaded analysis engines, developed according to UIMA (Unstructured Information Management Architecture) standards, process texts and images in thousands of documents in parallel. UNICORE (UNiform Interface to COmputing Resources) workflow control and execution features capabilities make it possible to dynamically allocate resources for every given task to gain best cpu-time/real-time ratios in an HPC environment.
 

RDF Pipelines

Three pipelines were developed to annotate documents and transform the results into RDF. Each pipeline consists of a reader, several analysis engines, and a writer, which are all easily exchangeable.

UIMA Workflow

Text Mining 2 RDF

The text mining pipeline reads the documents, using one of the various readers for different formats (i.e. XML, txt, PDF), preprocesses them if necessary, executes NLP tools like ProMiner or Peregrine to annotate the text, and finally transforms the results into RDF. Thereby a schema developed by UBO is used, which is based on the annotation ontology.

A preprocessing of the documents is often necessary, since one file may not contain exactly one document text or because not the complete file content shall be processed by the NLP tools. Such an example are the PubMed documents, which are downloadable as XML files containing about 30,000 documents each. Thereby the XML files contain also updated of documents, making it necessary to remove outdated documents during preprocessing. Further only the title and the abstract contained in the XML files shall be processed by the NLP tool. This is also prepared during the preprocessing.

Since some of these steps have to be performed for documents from different sources, and also documents from different sources may vary in their structure, even if the same file format is used, the preprocessing is split up in it's most basic functions which are also made highly configurable. Examples therefore are the XMLSplitter and the GenericXMLAnnotator, which both can be configured by a simple configuration file handed over as an command line parameter.

On the top an example implementation for a text mining pipeline can be seen. Here PubMed documents are read, preprocessed, and annotated by several ProMiner instances as well as by an Peregrine instance.

Document 2 RDF

The document pipeline is much simpler and just converts the documents into RDF. Therefore normally the same readers like in the text mining pipeline are used, followed by a subset of the preprocessing steps and a specific annotator for further document information.

In our example in the figure on the top, the only differences between the document and the text mining pipeline is the lack of NLP tools and an extended configuration file for the GenericXMLAnnotator, which in this case is not part of the preprocessing any more, but the main annotation step.
 

Concept 2 RDF

The concept pipeline transforms the used dictionaries and concept mapping files of the NLP tools into RDF. For OpenPHACTS a new text mining schema has been developed (c.f. RDF Schema, Publications). Therefore these files are annotated by individually developed annotation tools.

The UBO schema is based on the annotation ontology and consists of three parts: Text mining data, document database, and mapping/concept store. Each part contains provenance and license information.

Center: Text Mining Data

The middle part of the schema contains the annotations made by the text mining tools. It is strongly based on the ao schema, but is extended by the sso (nlp2rdf.lod2.eu/schema/string/) and str (nlp2rdf.lod2.eu/schema/sso/) ontologies from NIF.

The starting point of this part is the ao:exactQualifier representing an annotation. It can be found in the center. An ao:exactQualifier describes a text mining match of a concept at a specific position in a document. Thereby the annotation node is connected with the document and the concept node, as well as an annotation context node of the type aos:Selector, which stores the offset of the annotation in the document, the found text, and a confidence, which is necessary to describe the quality of the automatic generated annotation.

Left: Document database

The document database part stores all document centric information. It represents the document in which the annotations are found and is connected to several nodes representing its title, abstract, etc. Therefore the dc-terms (purl.org/dc/terms/) vocabulary is used. In our case we are using pmids as a URI for a document.
 

Right: Mapping / Concept Store

A concept can be a specific protein for example. The concept node just describes it in an abstract way, due to the absence of a unique naming convention. Concepts between different data sources are mapped through the skos:exactMatch relationship. All possible names are stored in the skos:altLabel and skos:hiddenLabel nodes. Therefore any specific name from the synonyms can be used in a SPARQL query to include the other synonyms by using the concept. Next to the synonyms also the context of a concept is stored by using the rdf:type relation. In our example the context would be ‘Protein’. Concept hierarchies can be modeled by making use of the rdfs:subclassOf predicate. This allows us to include all subclasses of a concept in a query.

Used Namespaces

Here you find some example queries for our SPARQL endpoint. For better understanding or to develop your own please have a look at our RDF schema.

Give me all information for a given document ID. (Authors are here skipped, because this is a 1:n relation.)

Result

Give me all concepts for a given document ID.

Results

Demo

SPARQL Endpoint

A SPARQL endpoint containing the RDF representations of a subset of the PubMed abstracts from 2011, as well as annotations made by ProMiner and Peregrine run on these abstracts, can be found at:

http://ops-virtuoso.scai.fraunhofer.de:8893/sparql

Data of SPARQL endpoint

The RDF representations of annotations made by ProMiner and Peregrine on a subset of 595.548 PubMed abstracts from 2011, which are used in our SPARQL endpoint, can be given to you on demand.