Integrative Text and Data Analytics

Unstructured sources of information such as scientific publications, electronic patient files, but also patents, are available in very large numbers. The automated analysis of these unstructured knowledge sources requires substantial compute resources; however, scaling systems for information extraction must be optimized for HPC environments and, for example, harmonize with the existing middleware for the distribution of computationally intensive tasks. Fraunhofer SCAI makes complex text mining workflows executable on HPC environments and demonstrates the scientific use of high-performance computers for information extraction. The services offered concentrate on the cost-effective indexing of company archives with a focus on chemistry as well as on the development of clinical routine data for research purposes and for studies in health economics.

Solution approach

The basic idea is to break up complex workflows and data centers into small independent and distributed services (so-called microservices). We have developed a variety of such services and embedded established analysis tools in them. All services can exchange messages with each other. In contrast to fixed defined workflows, message communication allows tasks to be flexibly assigned to individual services and processed efficiently. Microservices can be started on individual workstations, dedicated servers or on our large cluster. We have successfully applied this approach to various application scenarios.

Application scenarios

Pharmaceutical research

Processing of confidential information from patient files