Biomedical knowledge graphs play a central role in big data integration. Bringing unstructured text into a structured, comparable format is one of the key assets. As cause and effect models, knowledge graphs can potentially facilitate clinical decision making or help to drive research towards precision medicine. Data and Knowledge Management, sometimes also called Information Management, is a core topic of Data Science. It is also a interdisciplinary field touching economics (how efficient and expensive is the solution?), psychology (do people use this solution in a way that was intended?) and of course computer science. Our aim is to build sustainable data infrastructure for biomedical data, personalized medicine, drug repurposing, reproducible AI and knowledge discovery.
A »knowledge graph« (sometimes also called a semantic network) is a systematic way to connect information and data points to knowledge. It is thus a crucial concept on the way to generate knowledge and wisdom, to search within data, information and knowledge. However, the power of knowledge graphs critically depends on context information and data integration. Here we provide a novel semantic approach towards a context enriched biomedical knowledge graph utilizing data integration with linked data. This graph concept can be used for graph embedding applied in different approaches, e.g with focus on topic detection and knowledge discovery. Thus, connecting knowledge graphs with context is a key feature. In this project we want to establish a novel systematic approach to knowledge discovery using contexts in knowledge graphs. For this, we enrich the existing graph structures and build a context hypergraph.
We create a proof-of-concept giant knowledge graph using labeled property graphs to test graph algorithms and provide a feasible environment to apply semantic graph embeddings. It is a highly scalable cloud-based service environment. This dense large scale labeled property graph testing system currently holds more then 75M nodes and 960M edges. The basis for generating our large-scale Knowledge Graph representation is biomedical literature (e.g.from PubMed and PMC). We also integrated bibliographic data and metadata from DBLP, monthly snapshot release of December 2019, see https://dblp.uni-trier.de/. Since the basic data coming from SCAIView is already annotated with different biomedical ontologies, we decided to annotate CSO to DBLP data. We enriched our graph with data from the EU Open Data Portal (CORDIS - EU research projects under Horizon 2020, see https://data.europa.eu/euodp/en/data/dataset/cordisH2020projects). This data set is free to reuse for both commercial or non-commercial purpose. Here, we integrated projects, their status, affiliations, persons and authors of publications mentioned in their data set.
This is the basis for answering semantic questions, graph queries and extensions based on NLP, Text Mining, FAIR Data and a step towards reproducible AI. This graph allows to compare research data records from different sources as well as the selection of relevant data sets using graph-theoretical algorithms.