Stop Collapsing Conformation Space
Single-pose reasoning is a convenience. Chemistry lives on conformational manifolds — and your tooling should treat ensembles as first-class objects.
In molecular ML, scaffold splits are often treated as the gold standard for evaluation.
That framing is too broad.
Scaffold splits test something real. If a model only works when the test compounds look very similar to the training set, that matters. A scaffold split can expose shallow memorisation and exaggerated claims of generality.
But the real question is: gold standard for what?
In a live small-molecule project, you are usually not operating in complete ignorance on a brand new scaffold with zero local data. You are trying to make better decisions inside an active programme.
A real medicinal chemistry programme is a local optimisation loop. You have a chemotype, a handful of assays, emerging SAR, known liabilities, finite synthesis capacity, and some ADME data already. You need to decide what to make and test next.
The question is not usually: Can this model generalise to a completely unseen scaffold family with no local information?
It is more often: Once I have a few data points, can this model learn quickly enough to improve the next round of decisions?
That is a much more relevant standard for deployed use.
Scaffold splits sound demanding. They ask the model to extrapolate across large structural jumps. That feels like serious intelligence.
In practice, they often overweight a kind of heroism that matters less than people admit.
In lead optimisation, you do not usually need the model to perform magical scaffold hopping from a standing start. You need it to become useful fast once the project starts generating local information — a few potency measurements, a handful of permeability readouts, early solubility flags, narrow congeneric trends, assay-specific quirks.
That is rapid local adaptation. And it is exactly what a pure scaffold split tends to under-measure.
People sometimes accept the local-adaptation argument for potency, then quietly assume scaffold splits still make perfect sense for ADME.
Not really.
In a real project, nobody sensible is trying to eliminate ADME assays entirely. You are not going to trust a model so completely that you stop measuring solubility, permeability, microsomal stability, or clearance proxies.
The actual value of an ADME model is different. You want a model that becomes informative after a few local measurements. You run a few assays, learn the local property landscape, update your priors, and then the model helps you decide which analogues are worth making and which liabilities are fixable.
So the relevant question is not: Can the model predict properties on a totally unseen scaffold with no local context?
It is: After 5, 10, or 20 local data points from this series, how quickly does the model become useful?
The field often acts as though the ideal model solves everything from scratch on unseen chemistry. That sounds impressive. It is also usually the wrong target.
Real teams need decision support, not omniscience.
For potency, they need help ranking close analogues. For ADME, they need help understanding how quickly local measurements can feed better next-step prioritisation. In both cases, the commercially relevant capability is often sample efficiency in the local regime — how fast the model learns once the project starts speaking.
For many ADME use cases, a more realistic benchmark would not be a single hard scaffold split. It would ask:
That is much closer to the real workflow. The model is not replacing assays. It is replacing some fraction of bad decisions between assays.
There are at least three different things people mean when they talk about generalisation:
Zero-shot generalisation — can the model say something useful on entirely new chemistry with no local data?
Local interpolation — can it exploit neighbourhood structure inside a live series?
Rapid adaptation — can it become useful quickly once a few project-specific data points arrive?
Scaffold splits mostly stress the first. Real project value often comes from the second and third.
That is why scaffold splits are not the gold standard. They are one test for one type of claim.
For potency, local SAR matters because medicinal chemistry is dominated by close analogue decisions.
For ADME, local learning matters because property landscapes are series-dependent and teams will always generate at least some fresh assay data.
In both cases, the model’s value depends heavily on how well it uses small amounts of local information.
Not whether it can perform a heroic cross-scaffold leap under artificially data-starved conditions.
Treating scaffold splits as the gold standard for every molecular model is like judging every employee by how well they perform on day one with no onboarding.
That tells you something. But for most real jobs, the more important question is how quickly they become effective once they have a bit of context.
A model that learns fast from sparse local data is often more valuable than one that looks more “general” in a benchmark designed around total novelty.
If the model is for hit finding or scaffold hopping, scaffold splits make sense.
If the model is for lead optimisation, project-local and temporal evaluation matter more.
If the model is for ADME support inside a live programme, the key test is: how quickly does it become useful after a few local assays?
Not because novelty does not matter. Because deployment does.
The right question is not: Did it pass a scaffold split?
It is: Did it succeed on the task it is actually supposed to perform in a real project?
For a lot of potency and ADME work, that task is rapid, local, decision-relevant learning. That is the standard that actually matters.