Scaffold Splits Are Not the Gold Standard
Scaffold splits test one kind of claim. Real project value comes from rapid local adaptation, not zero-shot heroism on unseen chemistry.
One of the clearest signs that a benchmark has gone wrong is when it becomes impossible to tell what the model is actually meant to replace.
A lot of modern ML in small molecule drug discovery is benchmarked as though performance exists in the abstract. A model predicts affinity. A model predicts a pose. A model scores compounds. A leaderboard appears. A paper claims progress.
But in a real project, no model arrives in a vacuum. It enters an existing stack. It sits somewhere between structural biology, computational chemistry, medicinal chemistry, assay turnaround, and synthesis capacity. It is not competing against “nothing”. It is competing against whatever the team already does to make decisions.
So the first question is not whether the model is elegant.
It is: What exactly is this model replacing?
If you cannot answer that precisely, the benchmark is already drifting away from reality.
If you claim your model helps hit finding, benchmark it against the workflows used for hit finding. If you claim it helps lead optimisation, benchmark it against how lead optimisation is actually done.
The real baseline is not some weak generic score. It is a combination of project-level SAR models, fingerprint-based learners, nearest-neighbour reasoning, matched-pair intuition, known chemistry constraints, live assay history, and the medicinal chemists themselves.
That is the incumbent. That is the thing your model has to beat, augment, or accelerate — not a toy baseline chosen because it makes a new architecture look dramatic.
The current wave of cofolding and affinity prediction work is technically impressive. But a lot of it is benchmarked as though solving a structural subproblem automatically translates into solving a project decision problem.
It does not.
A cofolding model may generate a plausible protein–ligand complex. An affinity model may produce a ranking across compounds. Fine. But if the implied deployment setting is lead optimisation, the right question is not whether the model beats a docking score or a generic literature baseline.
The right question is whether it beats project-level SAR.
Can it rank compounds better than ECFP4 plus XGBoost trained on the project’s own assay data? Can it outperform nearest-neighbour methods inside a congeneric series? Can it improve the actual make/test cycle?
Very often, those comparisons are missing. And when they are missing, that is the whole story.
There is something revealing about a benchmark that compares a new model to everything except standard practice. It suggests the benchmark was designed by people thinking about modelling in isolation, not by people who have deployed models inside live projects.
Anyone who has worked on a medicinal chemistry programme knows the score. Nobody cares that your model improved Pearson correlation on a benchmark dataset if the chemists still would not use it to choose the next compounds. Nobody cares about beautiful cofolded structures if they do not beat local SAR in the regime where decisions are actually made.
A lot of benchmarks in this area feel like they were written by ML practitioners who understand model classes, loss functions, and public datasets, but have never sat in a project meeting and defended why a particular compound should be made next.
Deployment changes what “good” means.
On a real project, the model is judged on whether it changes decisions. Does it help choose compounds a team would otherwise miss? Does it reduce wasted synthesis? Does it prune dead ends earlier? Does it improve the sequence in which compounds are made and tested?
A live project introduces constraints many benchmarks quietly ignore: the chemistry is local, the series is narrow, the data are sparse and biased, the assay history matters, synthesis feasibility matters, turnaround time matters, and the team already has strong priors.
That is why simple project-local models are so hard to beat. They are trained on the exact decision context that matters.
Imagine someone launches a new model for forecasting retail demand. They benchmark it against a 1998 linear baseline, show a nice improvement, and declare it ready for deployment. But they never compare it to what every operations team actually uses: gradient boosting on internal sales history, human overrides, seasonality rules, and stock constraints.
Everyone would immediately see the trick.
Or imagine building a new chess engine and benchmarking it against people who only know the rules, while carefully avoiding Stockfish.
Drug discovery papers do versions of this all the time. They benchmark against weak or irrelevant baselines, then imply they are competing with project practice. They are competing with a straw man.
Part of the confusion comes from collapsing different tasks into one vague idea of “better drug discovery AI”.
A structure-first model may help with pose hypothesis generation, target enablement, interpreting binding mode, or guiding scaffold exploration early on.
A project SAR model may help with ranking close analogues, estimating local potency trends, learning assay-specific quirks, or prioritising what to make next.
These are different operational roles. A structure model does not become a lead optimisation model just because it outputs an affinity-like number. A benchmark should not pretend otherwise.
If your model is for structural hypothesis generation, say that. If it is for prioritising compounds inside a live series, benchmark it against the models that already do that job. Do not quietly slide from one use case to another because it sounds more commercially relevant.
You see it when the task is defined too vaguely, the train/test split ignores temporal reality, the baseline omits local SAR models, the metric ignores project utility, the chemistry looks nothing like a real design cycle, and the paper never states where the model would actually be used by a team.
That is a sign the authors optimised for academic separability, not operational truth.
A useful benchmark in this field should feel slightly uncomfortable. It should force the model to compete against what practitioners actually trust.
Before benchmarking the model, write this sentence:
This model is intended to replace or augment X in the workflow.
Then be specific. Not “drug discovery”. Not “molecule design”. Be precise:
Now the benchmark can be designed properly. The baseline becomes obvious. The metrics become meaningful. The deployment claim can actually be tested.
If the model is for lead optimisation, benchmark it against project SAR. If it is for within-series potency ranking, benchmark it against ECFP4 plus XGBoost, nearest neighbour, random forest, and matched-pair baselines. If it is for structural reasoning, benchmark it on structural reasoning and stop implying that this automatically means it will improve compound selection.
The easiest way to puncture an over-claimed benchmark is to ask one line:
What exactly is your drug discovery model replacing?
If the answer is vague, the benchmark is vague. If the answer shifts halfway through the discussion, the benchmark is mis-specified. If the answer avoids project SAR, medicinal chemistry practice, or live decision-making, then the benchmark is probably not measuring project value at all.
It is measuring performance on a convenient abstraction. And in small molecule drug discovery, convenient abstractions are where models go to look impressive without becoming useful.