Scaffold Splits Are Not the Gold Standard

By Blaise AI Team

In molecular ML, scaffold splits are often treated as the gold standard for evaluation.

That framing is too broad.

Scaffold splits test something real. If a model only works when the test compounds look very similar to the training set, that matters. A scaffold split can expose shallow memorisation and exaggerated claims of generality.

But the real question is: gold standard for what?

In a live small-molecule project, you are usually not operating in complete ignorance on a brand new scaffold with zero local data. You are trying to make better decisions inside an active programme.

Most project decisions are local

A real medicinal chemistry programme is a local optimisation loop. You have a chemotype, a handful of assays, emerging SAR, known liabilities, finite synthesis capacity, and some ADME data already. You need to decide what to make and test next.

The question is not usually: Can this model generalise to a completely unseen scaffold family with no local information?

It is more often: Once I have a few data points, can this model learn quickly enough to improve the next round of decisions?

That is a much more relevant standard for deployed use.

Scaffold splits overemphasise the wrong failure mode

Scaffold splits sound demanding. They ask the model to extrapolate across large structural jumps. That feels like serious intelligence.

In practice, they often overweight a kind of heroism that matters less than people admit.

In lead optimisation, you do not usually need the model to perform magical scaffold hopping from a standing start. You need it to become useful fast once the project starts generating local information — a few potency measurements, a handful of permeability readouts, early solubility flags, narrow congeneric trends, assay-specific quirks.

That is rapid local adaptation. And it is exactly what a pure scaffold split tends to under-measure.

This is just as true for ADME

People sometimes accept the local-adaptation argument for potency, then quietly assume scaffold splits still make perfect sense for ADME.

Not really.

In a real project, nobody sensible is trying to eliminate ADME assays entirely. You are not going to trust a model so completely that you stop measuring solubility, permeability, microsomal stability, or clearance proxies.

The actual value of an ADME model is different. You want a model that becomes informative after a few local measurements. You run a few assays, learn the local property landscape, update your priors, and then the model helps you decide which analogues are worth making and which liabilities are fixable.

So the relevant question is not: Can the model predict properties on a totally unseen scaffold with no local context?

It is: After 5, 10, or 20 local data points from this series, how quickly does the model become useful?

Drug discovery rarely wants zero-shot perfection

The field often acts as though the ideal model solves everything from scratch on unseen chemistry. That sounds impressive. It is also usually the wrong target.

Real teams need decision support, not omniscience.

For potency, they need help ranking close analogues. For ADME, they need help understanding how quickly local measurements can feed better next-step prioritisation. In both cases, the commercially relevant capability is often sample efficiency in the local regime — how fast the model learns once the project starts speaking.

A better ADME benchmark would reflect deployment reality

For many ADME use cases, a more realistic benchmark would not be a single hard scaffold split. It would ask:

  • With 0 local data points, how good is the prior?
  • With 5 local data points, how much does performance improve?
  • With 10, can it rank the next compounds well?
  • With 20, does it beat simple project-local baselines?

That is much closer to the real workflow. The model is not replacing assays. It is replacing some fraction of bad decisions between assays.

The field keeps collapsing different goals into “generalisation”

There are at least three different things people mean when they talk about generalisation:

Zero-shot generalisation — can the model say something useful on entirely new chemistry with no local data?

Local interpolation — can it exploit neighbourhood structure inside a live series?

Rapid adaptation — can it become useful quickly once a few project-specific data points arrive?

Scaffold splits mostly stress the first. Real project value often comes from the second and third.

That is why scaffold splits are not the gold standard. They are one test for one type of claim.

Lead optimisation is local, and ADME optimisation is too

For potency, local SAR matters because medicinal chemistry is dominated by close analogue decisions.

For ADME, local learning matters because property landscapes are series-dependent and teams will always generate at least some fresh assay data.

In both cases, the model’s value depends heavily on how well it uses small amounts of local information.

Not whether it can perform a heroic cross-scaffold leap under artificially data-starved conditions.

A better analogy

Treating scaffold splits as the gold standard for every molecular model is like judging every employee by how well they perform on day one with no onboarding.

That tells you something. But for most real jobs, the more important question is how quickly they become effective once they have a bit of context.

A model that learns fast from sparse local data is often more valuable than one that looks more “general” in a benchmark designed around total novelty.

The benchmark should match the workflow

If the model is for hit finding or scaffold hopping, scaffold splits make sense.

If the model is for lead optimisation, project-local and temporal evaluation matter more.

If the model is for ADME support inside a live programme, the key test is: how quickly does it become useful after a few local assays?

Not because novelty does not matter. Because deployment does.

The right question is not: Did it pass a scaffold split?

It is: Did it succeed on the task it is actually supposed to perform in a real project?

For a lot of potency and ADME work, that task is rapid, local, decision-relevant learning. That is the standard that actually matters.

You Might Also Like