Concurrent Materials Design, Accelerated by AI
This article references topics presented by Dr. Michael Heiber at Enthought’s 2025 R&D Innovation Summit in Tokyo. Link to video below.Over the last...
Software & AI
Scientific Software Development, Legacy Software Modernization, UI/UX,
Predictive Modeling, Custom Simulations, Web Applications,
Multimodal Knowledge Systems, API Development
Data Systems
Data Engineering, Process Engineering, Data Pipelining and Augmentation,
Workflow Automation and Redesign, Scientific Data Management Systems,
Data Capture Systems, High Volume Data Management, Database Design
Strategy & Design
R&D AI Transformation, R&D Digital Transformation, Strategic Roadmap Development,
Data System Design, Process Analysis
Infrastructure
Technical Upskilling for Scientists & Engineers, R&D Systems Integration,
R&D IT and Data Ops
Core Technologies
Machine Learning, Deep Learning, Baysian Optimization, Generative
Adversarial Networks, Graph Neural Networks
Advanced Modeling & Systems
Reasoning Models, Multi-Scale Modeling, Surrogate Modeling,
Simulation, Image Processing, Agentic AI Systems
Language & Generative AI
Natural Language Processing, Foundation Models, Generative AI,
Large Language Models
Discovery & Development
Property Prediction, Formulation Optimization, Structure Generation,
Materials Discovery, Materials Compatibility
Data Insights
Text Data Mining, Automated Data Analysis, Time Series Analysis,
Multimodal Search, Literature and Patent Search, Dashboards, Data Visualizations
Decision Support
Chatbots, Predictive Maintenance, Preventative Maintenance, AI
Recommendation Systems
Making Sense of Agentic AI | You can now watch this timely webinar on agentic AI in materials & chemistry R&D on-demand.
4 min read
Enthought Jun 30, 2023 11:30:00 AM
The rise of ChatGPT and BARD based on large language models (LLMs) has shown the potential of generative models to assist with a variety of tasks. However, we want to highlight how generative models can be used in many more areas than just language generation, with one particularly promising area: molecule generation for chemical product development.
Conventional discovery of new molecules is often done through expensive and time-consuming trial-and-error approaches. This means that scientists may miss many potential candidates. Instead, generative models can be used to explore molecular spaces and discover molecules that have not been synthesized in the lab or even theorized in simulations. These novel molecules could therefore be patented.
In the generative models, an important aspect is the molecules representation, with some of the more common representations being molecular graphs, SMILES, and SELFIES. SELFIES have the additional advantage that every SELFIE string is a valid molecule. This allows for SELFIES to be mutated or generated from latent space and still represent a valid molecule. These molecule representations are the input to the generative model, with some of the more popular generative models being variational autoencoders (VAEs), generative adversarial networks (GANs), and normalizing flow.
SELFIES, SMILES, and molecular graph representations of the caffeine molecule
These generative models are unsupervised models, which means they don’t require knowing the properties of the molecules (i.e. unlabeled). The models can therefore be trained on large databases of molecules even if they don’t have the properties of interest. For example, the models can be trained on large existing open-source molecule databases, such as ChEMBL or QM9, which significantly reduces the cost and complexity of data collection.
However, there are some problems with this approach. For example, the generated molecules might not satisfy all property requirements, might be difficult and costly to synthesize, or may not meet other business requirements like manufacturability. This can be especially difficult when there are multiple business requirements. Fortunately, there are several different solutions to this.
One option is to use a funnel or filter approach. In this approach, many molecules are generated and then each molecule is checked to see if it passes certain requirements. If it doesn’t pass the requirement, the molecule is removed from the potential molecules. Then the remaining molecules will be tested on the next filter until all filters have been tested. These filters can be a simulation, experiment, or even a machine learning (ML) model. While this approach is rather simple, there are some advantages to it. For one, cheaper filters can be used first to reduce the number of candidate molecules, reducing the overall cost of discovering a new molecule. The cheap filters are any test that can be done quickly and at a low cost for many molecules, such as a ML model or simulation. However, this approach can still be very expensive if the filters are expensive, with some lab experiments and simulations being costly and time-consuming to run.
In the case where there are no robust and cheap models, another approach is to guide the exploration of these novel regions of chemical space using methods such as Bayesian optimization or active learning. Bayesian optimization would be used to find the best molecule in the chemical space, whereas active learning would be used to create accurate and robust models. The active learning approach would also be useful for creating an ML model which could be used as a cheap filter in the filter approach.
If the generative model represents the molecules in a latent space, such as in a VAE and normalizing flow, one could instead train a machine learning model to predict the given properties from the latent space. The problem can then be treated as an inverse design problem, where promising candidates are optimized in the latent space. Then the latent space representation can be transformed back into a molecule for further exploration. These methods allow for gradient-based techniques to be used to find the best molecule.
A different approach is to instead train your generative model on molecules that are already known to pass your requirements, such as synthesis cost and manufacturability. So instead of learning the distributions of all molecules, the generative model will instead learn the distribution of molecules that pass your requirements. The new generative model would then generate molecules that pass your requirements. However, this requires the data to be labeled, essentially converting the problem from an unsupervised task to a supervised one. To reduce the data requirements, one can start with a pre-trained generative model and then tune the model. The major advantage of this approach is that the molecules are more likely to pass your requirements, so fewer molecules have to be generated and tested.
At its heart ChatGPT is just trying to come up with the "best next word" over and over, building up its responses one word at a time. In some sense, an LLM is just a "sophisticated autocomplete." Therefore, these models are very good at producing semi-structured text, such as computer code, configuration files, and standardized reports (and also answers to exam questions!), because semi-structured text is even more predictable than natural language. Of course, to be able to do this, the model has to be trained on appropriate examples of the desired output, but ChatGPT has demonstrated surprising adeptness at producing small but useful routines in common programming languages just from the code examples included in its general training data, without any specific additional training.
Semi-structured text is very common in R&D contexts. It might be an algorithm to perform an analysis of some data; a section of a report on the results of an experiment; or perhaps a SQL query against a knowledge base. It may never be the same each time, but there are general patterns that it follows and expectations in formatting and style. In a traditional lab, writing these documents generally falls on the researchers, and amounts to a significant change in the flow of their work. They are no longer thinking about the research problem, but instead thinking about computer code or getting data into a document or how to connect to the database.
Leaders of R&D organizations would much rather have their scientists, engineers, and researchers focus on doing science, engineering and research. Research at UC Irvine1 shows that it can take up to 20 minutes to get focus back on the primary task after a distraction. By leveraging LLM-based tools to generate structured text through conversational prompts, the researcher is more likely to be focused on the high-level research task. In the same way that regular autocomplete can speed up sending a text message, keeping you focused on the message you want to send, these tools can speed up the creation of other types of text while keeping focus on the larger task.
Of course, just as with regular autocomplete, sometimes LLMs will get things wrong, and so they still need a human in the loop.
While there are several important factors to consider when utilizing these generative models for molecules, these technologies are becoming mature enough such that valuable and industrially viable materials informatics software solutions can now be built. Integrating generative methods into your workflows can help quickly identify promising candidates that meet all of your design requirements since the generative models for molecules can be thought of as giving you access to an infinite database of promising new molecules. Innovation leaders are investing in learning how to leverage these tools in their R&D teams to give them a competitive advantage, especially when pursuing new markets with high growth potential and room for major chemical innovation.
Watch Webinar-on-Demand: Materials Informatics for Product Development: Deliver Big with Small Data
This article references topics presented by Dr. Michael Heiber at Enthought’s 2025 R&D Innovation Summit in Tokyo. Link to video below.Over the last...
This article was originally published on Forbes and can be foundhere. By Michael Connell, EdD | Chief Operating Officer, Enthought Inc. AI is...
The specialty chemicals and materials industry is undergoing a significant shift. For companies that have historically relied on the strength of...