What honest validation looks like for AI-driven target discovery

Written by Euretos News | Apr 15, 2026 7:00:00 AM

The drug-discovery field has spent the past several years debating whether AI-driven approaches deliver clinical benefit. The debate has produced more heat than clarity, partly because “AI for drug discovery” has been used to describe everything from large-language-model literature summarisation to protein-structure prediction to integrated knowledge-graph platforms — methods with different strengths, different failure modes, and different things they should be measured against.

Inside that conversation, one specific question keeps coming back: what does it mean for a target-discovery platform to actually work? The Euretos team has been working on platforms of this kind for more than a decade. This post describes the way we think about validation — what counts, what does not, and where the field’s vocabulary still gets in the way.

Recovery is the first test, not the only test

The first thing a target-discovery platform should be able to demonstrate is that it ranks known answers correctly. If the platform’s ranked candidates for a well-studied disease do not include the canonical drug targets near the top, the integration is broken somewhere. Recovery rates against approved drugs and against well-validated mechanism targets are the entry-level test.

This is necessary, but it is not the headline of platform value. A naïve ranking that assigns each gene a score equal to its number of disease-related publications will achieve high recovery rates against the textbook targets, because those are the targets the textbook describes. Recovery against the well-known is a sanity check, not a discovery claim.

Programs that report recovery numbers as their headline result are reporting a sanity check. The honest framing of recovery is: “the platform’s evidence integration matches the field’s established understanding of these diseases, which is a necessary condition for trusting it on the harder questions.” Anything more than that is overclaim.

Per-disease ranking is what customers actually use

The metric that maps directly to translational research workflow is per-disease ranking. For a single disease, the question is not “across the genome, do approved drug targets rank well in our score” — it is “for this disease, which gene do I pick from the candidate list.”

That is a within-disease question. It is measured by a within-disease metric: how often the candidate at rank 1 (or rank 5, or rank 50) is a target the program will prioritise after manual review by domain experts. The validation here is not against published literature; it is against the judgment of senior researchers running their own diseases through the platform.

We measure this kind of validation through structured customer engagements rather than through public benchmarks. A pharma research customer will run their flagship indication through the platform, compare the platform’s ranking to their internal target list, and tell us where the rankings agree, where they disagree, and where the disagreements are platform errors versus platform-surfacing-something-the-internal-list-missed. This kind of validation cannot be reduced to a single number, and that is part of what makes it useful.

Novelty is the harder claim

The harder validation question — and the one most “AI for drug discovery” claims dance around — is whether the platform surfaces useful candidates that the field has not already named. Recovery says the platform finds the textbook. Novelty says the platform contributes something the textbook does not.

Validating novelty rigorously is genuinely hard. Two designs are credible.

The first is time-stratified hold-out. Take a set of targets that entered clinical development after a defined cutoff date. Score them with a platform whose underlying evidence base has been frozen at that cutoff. Ask: are these post-cutoff clinical targets concentrated in the platform’s higher-scoring regions, controlling for the obvious confounders of literature volume and disease prevalence? If they are, the platform was identifying real signal on the candidates before they reached clinic. If they are not, the platform may be replicating existing knowledge without extending it.

The second is prospective validation. Run a target-discovery query against a disease where a clinical trial is currently ongoing. Compare the platform’s ranking to the trial outcome when it reads out. This is harder logistically — trial timelines are years and the platform’s recommendations need to be locked at a specific date — but it is the closest analogue to a randomised test of platform value.

Both designs are work-in-progress for the Euretos platform. We have run informal versions of both with translational research customers. The published versions will appear in due course.

The validation framing customers ask for

In our experience, what customers actually want to see is not a single benchmark number. It is three things in combination:

Defensibility. When the platform surfaces a candidate at the top of a disease ranking, can the user trace each contributing piece of evidence back to its source — the GWAS that anchored the genetic association, the perturbation experiment that supported the mechanism, the cell-type atlas that grounded the expression context? Customers run platforms in advisory committees and project reviews where every claim has to defend itself. Traceability is the single feature that makes that practical.

Reproducibility. Two researchers running the same query against the same platform should get the same ranked list. Two queries on the same target a week apart should agree, unless the underlying evidence has materially changed. Computational platforms that reorder dramatically run-to-run cannot be deployed in serious target work.

Honest scoping. The platform should be clear about what it does and does not claim. Recovery is recovery, not novelty. Cell-type-resolved ranking depends on the cell-type atlas it draws from. The Indication Selection landscape is a candidate map, not a recommendation. Each capability has a defined scope, and the platform’s documentation should reflect that.

These three together are the validation framework we apply to our own work and the framework we suggest customers apply when evaluating any AI-driven target-discovery platform.

View full post