Why Benchmark Winners Often Make Poor Project Copilots
A model can top a leaderboard and still be slow, brittle, opaque, and nearly useless inside a live medicinal chemistry programme.
Drug discovery has long had two different instincts.
One says: start with the target, the structure, the mechanism. Understand what the molecule is doing, then improve it.
The other says: start with the biology. See what changes in cells or systems first, then work backwards to understand why.
That is the old split between rational drug design and phenotypic discovery.
What is striking now is how much of molecular AI has centred the first view.
Most models are built around the rational frame. Ligand, target, structure, affinity, pose, property prediction, optimisation against a defined objective.
That makes sense. It is cleaner. The data are usually more structured. The tasks are easier to define. The feedback looks more legible.
But I do wonder whether that cleanliness has pulled the field toward the questions that are easiest to model, rather than the ones most worth answering.
Rational design is powerful. It gives clearer assays, clearer hypotheses, and a more directed optimisation loop.
But it also depends on a chain of assumptions holding up:
that the target is the right one, that the assay captures something meaningful, that binding translates, that the mechanism matters in the real biological setting, that toxicity or compensatory biology does not dominate later.
Phenotypic discovery starts from a much less tidy place.
The assay is messier. The mechanism is often unclear. The optimisation path can be much harder.
But it is also asking a more direct question:
Does this compound actually produce a useful effect in a biological system?
That seems worth taking seriously.
Especially because so many discovery failures are not failures of binding prediction in the narrow sense. They are failures of translation.
Have we over-centred the rational view in AI for drug discovery because it is the one most compatible with current modelling culture?
And if so, what gets neglected?
Not because phenotypic discovery is obviously better. It is not. It has real costs in assay design, attribution, and hit optimisation.
But because it may be closer to the parts of biology that most often break our elegant stories.
Maybe the real question is not whether discovery should be “rational” or “phenotypic”.
Maybe it is whether our tools are too biased toward problems that are neat, decomposable, and benchmarkable — and not biased enough toward problems that are noisy, entangled, and biologically decisive. It seems the phenotypic approach deserves more attention than it currently gets in molecular AI.