Medicinal Chemistry Is a Bandit Problem

By Blaise AI Team

Lead optimisation is usually described as a modelling problem.

Predict potency. Predict ADME. Predict which substitution helps. Score the next compounds.

That is not wrong.

It is just incomplete.

The deeper structure of the work looks much more like a bandit problem: exploit the region that already looks promising, or spend resources exploring a direction that might be worse now but unlock a better series later.

Medicinal chemists make this trade every week. They just do not call it that.

The real tension is not prediction versus no prediction

The real tension is exploitation versus exploration under cost.

Do you keep pushing the current vector because the local SAR looks strong? Do you branch into a riskier region because you suspect the current chemistry is saturating? Do you make the obvious analogue that probably works, or the uncertain one that could tell you whether the programme is quietly heading into a dead end?

Those are not just questions about estimated activity.

They are allocation questions.

Every compound is a pull with a price tag attached: synthesis effort, assay time, queue impact, and the chance that the result clarifies almost nothing.

The bandit in medicinal chemistry is nastier than the textbook one

In textbook settings, arms are often cheap and clean. Rewards arrive quickly. Pulls do not change the cost of future pulls very much.

Drug discovery is worse behaved.

Arms are correlated because analogue series share chemistry and biology. Rewards are delayed. Some pulls require route invention before they can even be attempted. The best arm may not be worth pulling now because its verification cost is too high. A failed experiment can still be valuable if it rules out a direction cleanly. A successful experiment can be almost useless if it teaches nothing new.

That means the true decision problem is richer than simple property ranking. It includes information value, synthetic burden, delay, and strategic positioning.

A model that ignores those terms is playing the wrong game.

Greedy play feels smart until the series collapses

This is why purely greedy compound selection disappoints so often.

The top predicted molecule may sit in the most exploited region of the current series. It may look like the safest bet. It may even deliver a small gain. But if every round keeps doing that, the project can quietly lose optionality. It learns too slowly about adjacent regions. It fails to test whether the current hypothesis is fragile. It overcommits to a local hill that turns out not to lead anywhere useful.

Good chemists instinctively resist this trap. They know some compounds are worth making not because they are most likely to win today, but because they preserve or expand what the project can learn tomorrow.

That is bandit thinking whether the field uses the vocabulary or not.

Exploration in discovery is not wandering aimlessly

One reason people underappreciate this frame is that exploration sounds vague.

In practice it is usually disciplined. A chemist may explore because permeability is threatening to become the dominant constraint. Or because the current vector has flattened. Or because one cheap perturbation can reveal whether a promising explanation is real or a mirage. Or because an alternative route opens a family of analogues that changes the speed of future rounds.

Exploration is not randomness.

It is strategic spending on uncertainty.

The reward is not only the compound result

Another place the standard framing falls short is in how it treats reward.

A bandit lens makes it easier to see that the payoff of an experiment is not just the measured property of the compound itself. The payoff includes what the result does to future decisions.

A weak compound can still be a strong pull if it sharply updates the team’s beliefs. A pretty result can be a poor pull if it leaves the major uncertainties intact. A route that opens ten follow-up compounds can matter more than one molecule with a slightly better number.

Once you take that seriously, many current benchmarks start to look too flat. They reward the immediate compound outcome while ignoring the strategic value of the pull.

This should change what systems are built to do

If medicinal chemistry has this structure, then the useful system is not just a predictor. It is a policy aid.

It should help the team balance exploitation against exploration. It should understand that the cost of a pull depends on synthesis. It should value experiments by how they improve the next round, not only by expected score. It should be sensitive to current project state, not only compound descriptors. It should know that an expensive uncertain jump has to clear a higher bar than a cheap local probe.

That is a different standard from static supervised learning.

It is also much closer to the actual difficulty of the work.

The field often rewards the wrong kind of cleverness

There is a lot of glamour in building models that look broadly chemically intelligent. There is less glamour in asking whether they behave sensibly inside a stateful, expensive exploration-exploitation loop.

But the second question is the one that decides whether the system becomes useful in a live programme.

If the model always exploits local confidence and never pays to reduce the right uncertainty, it is not being conservative. It is being shortsighted.

If it explores without regard to route burden and assay delay, it is not being imaginative. It is being operationally naive.

The hard part is the balance.

That is why the bandit framing matters. It makes the trade explicit instead of smuggling it into a property score and hoping no one notices.

Lead optimisation does involve modelling.

But underneath the modelling sits a harder question: how do you spend scarce experiments to improve the project as fast as possible?

That is the real game.