In a recent publication in Nature Scientific Reports researchers at the Leiden and Rotterdam medical universities have used Euretos AI Platform data to train a machine learning model that predicts with 92.2% accuracy whether a drug will be successful in clinical trial. They were able to validate this model with a qualitative analysis of drug repurposing candidates for Autosomal Dominant Polycystic Kidney Disease (this aspect will be discussed in a separate news item).
The feature set for the machine learning model was entirely derived from the Euretos AI Platform and contributed to an unprecedented high prediction accuracy (AUC) of 92.2%.
This was achieved by leveraging key characteristics of the Euretos AI Platform data, in particular the (1) breadth of integrated data sources, (2) semantic properties of the data and (3) the networked properties of the Euretos graph model. These aspects will be detailed below.
First of all, the breadth of integrated data sources contributed to the performance of the model. The Euretos AI Platform now integrates over 250 textual sources and databases, making it the largest integrated life sciences knowledge base. This provides a very broad background of relevant information to be included in the model. Integrating so much data is such a major effort that in practice it is infeasible to do for a single project. In comparison other research efforts have therefore been using significantly lower number of data sources, as stated by the researchers: “However, whereas Himmelstein et al. integrated 29 knowledge sources with each other specifically for their drug prioritization task, we were able to save a considerable amount of time and effort by using an existing knowledge graph.”
Secondly, the researchers were able to “leverage the semantic properties of concepts intermediate to drugs and diseases as features to classify and prioritize drug candidates”. The Euretos AI Platform is unique in terms of being a fully semantically and ontologically integrated knowledge base. In the model, the researchers used two semantics aspects (there are many more) : (1) semantic types, which are concept categories such as “Enzyme” or “Sign or Symptom”, and (2) semantic groups, which are higher level abstractions of the semantic types (e.g. “Physiology” or “Disorders”). This provided a biologically significant value add to the data which was leveraged with great effect and is visualised below:
In total 130 features were based on the semantic properties of the intermediate concepts between the drugs and the diseases
A third important aspect is the use of the indirect relations that are inherently available in the Euretos graph network database as visualised below:
As the researchers noted, it is the information from indirect relations that provides all the predictive power of the machine learning model: “One might hypothesize that the existence of a direct path between a drug and a disease would be predictive of an approved drug-disease combination. To test this hypothesis, we counted the number of direct paths in the knowledge graph for the terminated and approved subsets. In the approved subset 50% of the combinations had a direct path, while in the terminated subset 45% of the combinations had a direct path. These results indicate that the presence of a direct path has very limited discriminative value if used as a single feature.”
Overall the feature sets extracted from the Euretos AI Platform proved to enable very high machine learning model performance. As concluded by the researchers: “In summary, our method demonstrates that the variation and frequencies of semantic types and categories of intermediate concepts between drugs and diseases can be used as highly predictive features for classifying drug-disease combinations as “Approved” or “Terminated”. Because this task is a proxy for efficacy, our method is likely to be suitable for drug repurposing as well.”