The chemical reaction database

NNNS Chemistry blog

Prevous: A game of humans against algorithms

Introducing the Precursor-Prompter 1.0

15 March 2026 - Research update 00007

Thus far this blog has been reviewing and commenting on efforts in automated synthesis prediction (computer assisted synthesis prediction, CASP) but so far I have not taken on the challenge myself. Why would I? The main purpose of the website is the curation of an organic reaction dataset (doi) that others can use in their work and I am happy with the number of downloads. I am nevertheless introducing Precursor-Prompter 1.0, an algorithm that can take any molecule and propose one or more precursors. The idea is not new and inspired by 2025 work from Haorui Wang et al. (doi). In this research effort the prediction is broken up in several parts. In the first part the desired molecule is matched against the products in a reaction SMILES (r-SMILES) dataset via Tanimoto similarities. A high similarity suggests that the desired molecule can be synthesised in a similar way as the product in the reference reaction, assuming that in the dataset all reactions are about making complex molecules from less complex precursors. In the subsequent parts an LLM is prompted to come up with plausible precursors based on target molecule, reference reaction and a selection of reaction SMARTS.

The Precursor-Prompter is much less ambitious and very simple, the top-10 reference reactions are selected based on Tanimoto similarities but the LLM is only required to make a single precursor guess based on the desired molcule as a SMILES string and the reference reaction as a r-SMILES string. A series of experiments have been set-up with as target molecules a set of molecules mentioned in recent 2026 work by Sathyanarayana et al. (doi) that describes another LLM-centered synthesis prediction algorithm. Of course you can select any molecule for experimentation but choosing this particular set prevents any cheating: after all, a dataset uploaded in 2025 cannot be purposely seeded by reactions to affect the outcome of experiments with molecules that were not yet known in 2025 at least to me. In the Wang reference work, a Morgan fingerprint is used to calculate similarity but this work is happy with a MACCS fingerprint based on previous experiences, the dataset used is our own open-access CRD 1.44M dataset, compiled in december 2025.

As with respect to this dataset, is bigger better? In this case I would argue that a bigger database is always better because it gives you access to more relevant reference reactions. The quality of the database is of course an issue but the 1.44M dataset can easily be inspected via the website where a dedicated page gives you 10 random reactions on each page refresh (link). In this sense the USPTO-50K dataset is too small and the USPTO-FULL database hampered by noise. The Pistachio dataset referenced by Wang et al. is not publicly accessible to my knowledge and cannot be inspected.

For the sake of transparency and reproducibility even the Python code used is made available as it is included in the supporting information. A basic Tanimoto run will take more than an hour to run on 1.44M lines but with some tweaks (pySpark) the runtime was reduced to mere minutes, good enough for the purpose of this demo. The default LLM used in this report is gpt-5.2 and the LLM is not pre-trained in any way. The prompt used is also given in the supporting information. A run (the web-based sandbox) typically takes minutes to complete.

And now the results. The first target in the Sathyanarayana study is a compound called Ohauamine C (a tricyclic depsi-tripeptide) for which the authors envision a simple intramolecular amide formation step from the amine and the acid chloride in the first deconstruction. In this report the reference reaction with the highest similarity score is a aminal formation step from a diamine and an aldehyde. The gpt-5.2 prompt result suggests a precursor that is also an amide-amine that reacts with 4-methyl-2-oxopentanoic acid. When the reasoning is cranked up to gpt-5.4 the diamine-carbonyl reaction is intramolecular. In both cases that solution stays true to the reference reaction, the atom count checks out and the precursors look feasible. The second reference reaction is a deprotection. Structurally the reaction product is totally unrelated but what should count is a functional group match. Unsurprisingly the LLM result is a likewise protected Ohauamine C and not very helpful. The third reaction type hint is also not helpful as the reference reaction is a complex Malaprade reaction / reduction sequence of a 1,2-diol. Luckily the LLM now completely ignores the reference reaction and comes up with an ester disconnection all by itself.

The second target in the Sathyanarayana study is an azepine which you could synthesise by an intramolecular ring-closing aromatic substitution. However the reference reactions for this target in the 1.44M dataset point toward a secondary amine methylation instead. I guess that has to do with a common way to synthesise azepines: by a reductive amination and then the N-methylation is always the last step. Moving on to the macrolide erythromycin B as the third target for which the precurosor-prompter LLM should propose a ring-opened precursor if the Sathyanarayana results are to be reproduced. In the precurosor-prompter analysis another related molecule in the dataset, erythromycin A interferes and the reference reactions are not at all helpful. With target molecule number 4, reserpine, the precurosor-prompter analysis and the Sathyanarayana study are in agreement, the first disconnect is an esterification. The same can be said of the final target number, discodermolide, where a terminal diene tail end is assembled in a Negishi coupling or a Stille coupling with a vinyl fragment.

In conclusion, the Tanimoto similarities do not always produce relevant reference reactions to work with, a set of rules should be devised to quickly dismiss reactions before they waste time in the LLM mill.The tested LLM (gpt-5.2) is surprisingly good at analysing the forward reaction and surprisingly good at retrosynthesis but also surprisingly slow. At the outset of the project I anticipated that a reaction type hint should be added to the prompt but this is not required at all. The LLM is also surprisingly good at SMILES, only in one instance the suggested SMILES string was unparsable, hallucinations there not evident. The supporting information is available at Zenodo.