Patent Data Is the Most Underused Asset in Small-Molecule AI

By Blaise AI Team

This kind of programme-level data — activity, selectivity, ADME, assay context, series structure, exemplified chemistry — is not unique to patents. Academic journals publish it. Public databases aggregate it. Some curated datasets do a good job of structuring it.

But patents contain an enormous amount of it, and most of it goes unused.

Not because the data is hidden. It is published, in the technical sense. The problem is that patents are adversarially produced. The filing company has every incentive to disclose enough to secure protection, and no incentive whatsoever to make that disclosure easy to parse, extract, or computationally reuse. Markush structures, inconsistent table formats, deliberately broad language, non-standard naming, data scattered across claims, examples, and supplementary tables — all of this makes automated extraction genuinely hard.

So the data is there. It is rich. And it is surrounded by a moat that most pipelines do not cross.

Why is this data there at all?

If a company wants meaningful patent protection around a chemical series, it cannot just gesture at a vague idea. It needs to disclose compounds and provide enough support for the claimed invention to look real, enabled, and valuable. In pharma, that usually means showing concrete examples and associated biological data.

So patents become a strange but powerful artifact: part legal document, part scientific disclosure, part compressed snapshot of a real discovery programme.

The data inside them was generated under the pressure of programme decisions, IP strategy, and competitive timelines. That makes it much closer to the distribution of molecules that actually matter in practice than anything you get from a curated benchmark.

And there is simply more of it than in any journal. A single well-exemplified patent can contain hundreds of compounds with associated biodata. The volume dwarfs what a typical academic paper discloses. The challenge is not access — patents are public documents. The challenge is extraction at scale from documents that were never designed to be machine-readable.

A natural source of out-of-domain evaluation

Here is where it gets genuinely useful for model builders.

Patent space keeps expanding. New series keep being disclosed. New targets appear. New chemotypes emerge. New assay setups and optimisation strategies show up over time.

That means patent data does something most small-molecule benchmarks do badly: it gives you a natural source of future, out-of-domain evaluation.

Train on older patent space. Test on later disclosures. See whether the model generalises to genuinely new chemistry, not just a random scaffold split that still leaks project conventions.

That is much closer to the real question: can your model say something useful about chemistry that did not exist when it was trained?

Random splits cannot answer that. Scaffold splits get partway there. Temporal splits on patent data get much closer.

The time series you did not know you had

The temporal signal is not limited to comparing patents filed in different years. There is a second, subtler time series baked into the structure of a single patent.

Look at the reaction routes. Patents typically disclose how their exemplified compounds were made. That means you can extract which synthetic transformations were used across a series and start to see which reactions actually proceeded. When a route appears repeatedly across many analogues, it probably worked reliably. When a route appears once and vanishes, it probably did not.

Now look at the fragments. The frequency and complexity of different R-groups across a patent’s compound table is not random. It reflects the order in which the team explored chemical space. Early in a programme, you see simple, broadly varied substituents — the team is scanning. Later, you see more complex, targeted modifications — the team found something worth optimising and started probing specific vectors.

That pattern of fragment count, fragment complexity, and which positions get explored encodes the programme’s optimisation trajectory. Which R-groups improved potency. Which killed solubility. Which vectors were abandoned. Which were worth elaborating.

You do not need the team’s internal timeline to recover this. The chemistry itself carries the signal.

So patent data does not just give you a between-patent time series across the field. It gives you a within-patent time series that approximates the logic of a single programme’s evolution. That is a much richer training signal than a flat table of structures and activities, and it is hiding in plain sight in every well-exemplified patent.

The obvious weakness

Patent data is rich in positives and optimised compounds, but weak in true negatives. It tells you a lot about what a team chose to disclose, and much less about all the compounds that were made, deprioritised, or quietly abandoned.

That bias is real and it limits what you can learn from patents alone. You will always know more about the winners than the failures.

So patents are not the whole answer.

But they may be one of the best available windows into how real small-molecule programmes actually look in the wild.

What the opportunity actually looks like

The point is not “extract more molecules from PDFs”. Structures alone are table stakes.

The real opportunity is turning patents into a living dataset of molecules + measurements + context + time.

If you can reliably extract not just structures, but the attached data and provenance, patents start to look less like a static database and more like a continuously renewing stream of medicinal chemistry intelligence.

That stream has properties you cannot easily get elsewhere. It is grounded in real project decisions. It covers diverse target classes. It spans multiple organisations and therapeutic areas. And it keeps growing without anyone having to curate it.

For anyone building small-molecule AI that is supposed to work on real programmes, ignoring that stream looks increasingly hard to justify.