Medicinal Chemistry Is a Bandit Problem
Lead optimisation is an exploration-exploitation problem with synthesis cost, assay delay, and local uncertainty baked in.
People talk about patent extraction as if the goal is to pull more molecules out of PDFs.
That framing undersells the opportunity by a wide margin.
A strong medicinal chemistry patent does not contain one kind of training signal. It contains several, layered on top of each other, and each one maps to a different ML task. Treating the whole corpus as a single flat source of compounds throws most of that away.
The standard way to train a small-molecule foundation model is on large sets of isolated compounds. Structures without relationships. Molecules without series context.
Patent data is structured differently. A well-exemplified patent gives you related compounds built around a shared core, with both stronger and weaker analogues, and plenty of local structure–activity variation. That is a much richer training signal than a flat collection of unrelated actives. It teaches a model how chemistry moves within a series — what a single substitution does to potency, how a vector change affects selectivity, where lipophilicity starts to cost more than it buys.
That kind of local, relational signal is exactly what most large-scale pre-training datasets lack. And patents have it by default, because patents describe programmes, not molecules in isolation.
If you are a biotech building on a known mechanism where a competitor has already published a patent, that document is a compressed record of exploration someone else already paid for.
It shows which vectors were tried. Which substitutions moved potency. Where selectivity probably broke down. Which motifs were abandoned. Which ideas were strong enough to keep building around.
For a fast-follow programme, this is directly actionable. The goal is not to re-derive the SAR from scratch — it is to reach a better molecule with fewer design–make–test cycles and fewer synthesised compounds. Patent data tells you where the minefield has already been partially mapped, so you do not have to step on every mine yourself.
For ML, that makes patent data useful not just for predicting activity, but for helping teams avoid wasting cycles on chemistry the field has already half-ruled out.
A patent preserves something most public datasets strip away: the shape of an optimisation campaign.
You can see which vectors attracted repeated investment. Which motifs were revisited with increasing complexity. Which regions of chemical space were explored briefly and then dropped. Where progress appears to have stalled or changed direction.
Even without the internal lab notebook, the compound table carries a trace of the project’s decision-making. The frequency and complexity of R-groups across positions is not random — it reflects which substitutions the team thought were worth pursuing, and which they gave up on.
That signal matters if you want models that help guide programmes, not just score compounds one at a time. A model that understands programme trajectories can do something a standard activity predictor cannot: it can infer where a series is heading and where it has already been.
The reaction schemes in patents are not textbook chemistry. They reflect what a real team actually chose to make while trying to move a programme forward under time and resource constraints.
That means they carry information about feasible transformations, preferred disconnections, protecting-group strategies, and which kinds of analogues were practical enough to synthesise at scale within project timelines. A route that appears across many analogues in a patent probably worked reliably. A route that appears once and vanishes probably did not.
This is a very different signal from reaction databases assembled from the literature without the same programme-level constraints. Academic reaction databases tell you what is chemically possible. Patent reaction data tells you what was practically chosen inside a real campaign. For retrosynthesis models and synthesis-aware design, that distinction matters.
Patent space keeps expanding. New targets appear. New chemotypes emerge. New series keep being disclosed. New optimisation strategies surface over time.
That makes patents unusually useful as a rolling out-of-domain benchmark. Train on older patent filings. Test on later disclosures. See whether your model generalises to chemistry that genuinely did not exist when it was trained.
That is a much more serious test than a random split or even a scaffold split. It asks the question that actually matters for deployment: can your model say something useful about the chemistry discovery teams are working on now, not the chemistry that was curated into a benchmark years ago?
And unlike static benchmark datasets, this test set renews itself. You do not need to wait for someone to assemble a new evaluation. The patent system delivers new chemistry on a rolling schedule.
Patents are messy. They are biased toward compounds worth disclosing. They are deliberately inconvenient to parse — filed by teams with no incentive to make the data easy to extract. And they are much richer in winners than in failures. You learn a lot about what was worth protecting and much less about what was quietly abandoned.
That limits what you can learn from patents alone.
But if you can reliably reconstruct molecules, measurements, synthetic routes, programme context, and temporal ordering from patent filings, the result stops looking like document extraction.
It starts to look like one of the most realistic and continuously renewing data assets available for small-molecule ML.