Text Mining 2 RDF
The text mining pipeline reads the documents, using one of the various readers for different formats (i.e. XML, txt, PDF), preprocesses them if necessary, executes NLP tools like ProMiner or Peregrine to annotate the text, and finally transforms the results into RDF. Thereby a schema developed by UBO is used, which is based on the annotation ontology.
A preprocessing of the documents is often necessary, since one file may not contain exactly one document text or because not the complete file content shall be processed by the NLP tools. Such an example are the PubMed documents, which are downloadable as XML files containing about 30,000 documents each. Thereby the XML files contain also updated of documents, making it necessary to remove outdated documents during preprocessing. Further only the title and the abstract contained in the XML files shall be processed by the NLP tool. This is also prepared during the preprocessing.
Since some of these steps have to be performed for documents from different sources, and also documents from different sources may vary in their structure, even if the same file format is used, the preprocessing is split up in it's most basic functions which are also made highly configurable. Examples therefore are the XMLSplitter and the GenericXMLAnnotator, which both can be configured by a simple configuration file handed over as an command line parameter.
On the top an example implementation for a text mining pipeline can be seen. Here PubMed documents are read, preprocessed, and annotated by several ProMiner instances as well as by an Peregrine instance.