Utilizing LLMs Today in Industrial Materials and Chemical R&D

By Vaibhav Palkar, PhD and Mike Heiber, PhD


Large language models (LLMs) are exciting and potentially transformative tools that should be a part of every materials and chemical R&D organization technology solution set. Despite the buzz around LLMs being all encompassing problem solvers by themselves, in practical applications they are part of a well-engineered solution involving several other important digital technologies as well. 

Based on our work with customers and review of the most recent academic literature related to LLM technologies within materials and chemical R&D, we find two categories of use cases to be most mature and ready for adoption in industry: knowledge extraction and lab assistants.



LLM Technical Concepts

Large language models (LLMs) are a subset of generative AI that are deep-learning based foundation models trained on “large” sets of text data (ex. chunks of the internet like Wikipedia and Github) and which require a “large” number of model parameters, on the order of tens of billion to a few trillion. There is now a rich ecosystem of available LLMs ranging from large, expensive, closed source models (GPT, Claude, Gemini) to smaller, cheaper, open source models (Llama, Mixtral, Gemma).

First, you should be familiar with three main technical concepts around LLMs:

Zero-Shot and Few-Shot Learning Context Window Function Calling
LLMs have a wide range of “zero-shot” capabilities, meaning they can be used to tackle many tasks without explicit training for the specific task. They can also demonstrate emergent “few-shot” capabilities, where only a small number of training examples need to be provided for the LLM to learn new patterns. Obtaining a response (inference) from an LLM is done by providing it with input text (prompt). The context window is the amount of additional text a model considers at the time of inference. In a chat with an LLM, the context window is the entire chat history of that session, including both the user’s prompts and the model-generated responses. * LLMs can connect to arbitrary software tools using a “function calling” API. Natural language can be used to describe a function’s purpose, inputs, and outputs so that the LLM learns when to call such functions. In the context of materials and chemical R&D, such function calling can be used, for example, to run simulations, search databases, and retrieve quantitative information such as molecular weight of compounds.

*Anthropic’s Claude 2.1 sets the current industry state-of-the-art for the maximum context window, standing roughly at 150,000 words (i.e. 200,00 tokens). Google’s latest teaser of Gemini 1.5 Pro promises an order of magnitude larger context window. (As of March 2024)

What goes into productionizing LLMs?

Despite the commendable prowess of LLMs for many different tasks, the scope of what can be directly achieved with a foundation model alone is limited, especially for tasks in niche, highly technical domains like scientific R&D. 

Getting useful results from an LLM requires developing a well-engineered solution that also leverages other important software tools. The top considerations that go into building and productionizing such a solution can be broadly classified into the categories below.

LLM Choice

The very first decision is which LLM(s) to try. Closed source LLMs, such as GPT4, are high performing but also come with higher operational costs. The smaller open source alternatives trail slightly behind in performance at much lower operational costs. Licensing and the costs associated with deployment are other important factors in deciding between open vs. closed source LLMs. Whichever you start with should be integrated into the solution in a modular fashion, allowing you to easily test different models and exchange models down the road when higher performing and/or lower cost models are released in the future. 

LLM Performance Improvement

Once you have your initial model selected, you need to figure out how to use it to produce the intended behavior with sufficient accuracy. The first, quickest way to adjust the model output is through Prompt Engineering. In most R&D cases, this provides some initial improvements, but more advanced optimization is needed to achieve sufficient performance. Retrieval Augmented Generation (RAG) and fine-tuning are the most common and effective methods, but broadly, these methods are better suited for two different kinds of optimizations2

Prompt Engineering

In prompt engineering, one modifies the prompt text to obtain a better result from the model. Prompt engineering methods span from implementing slightly abstract general guidelines, such as “writing clear instructions”, to scientific approaches, such as Chain-of-Thought (CoT) prompting1. There are several available tools (ex. LangChain) that aid with the systematic exploration of input prompts.

Retrieval Augmented Generation (RAG)

RAG involves augmenting the prompt with results from a search algorithm to provide new contextual information to the LLM, such as domain specific knowledge. Setting up a RAG pipeline for a company involves many steps, starting from an existing knowledge base: cleaning the knowledge base, data parsing and ingestion, chunking, indexing, embedding, retrieval, and compression. Each step can and should be optimized so that the best information is included with the original prompt. Since RAG may add a lot of domain specific information to the prompt, it consumes a significant part of the context window and results in higher operational costs. 


Fine-tuning involves tuning the model weights (parameters) and is better suited to tune the outputs of an LLM to a particular format. For example, even the popular ChatGPT application allows users to choose from two foundation models, GPT4 and GPT3.5, that are fine-tuned using chat conversations, so that the model behaves like a chatbot. Fine-tuning requires constructing a large, clean dataset to be fed to the model for training before subsequent usage and hence results in a higher upfront cost. It should also be noted that a fine-tuned model typically requires smaller input prompts and hence would be less expensive to operate.

Tool Development & Orchestration

Another important part of productionization-related work is developing tools and documentation to pass to the LLM’s function calling API. Having the LLM as the user of such tools needs to be taken into consideration while developing them. These tools can also include other LLMs that are optimized for particular subtasks. Complex solutions can utilize multiple LLM and non-LLM tools to perform various subtasks in the overall workflow, from improving the context of the RAG pipeline to improving the responses to specific types of prompts. 

LLMs in Industrial Materials and Chemical R&D

How well do LLMs understand domain-specific technical terminology, concepts, and relationships? More importantly, can they understand the narrow subdomain of interest to a business—in this case, materials science and chemistry—and incorporate the intricacies of a company’s process and terminology? 

Within the general materials science and chemistry domains, there are several benchmarks developed to answer these questions, including ChemLLMBench3 and MaScQA4. Evaluating model performance on domain-specific tasks is essential when developing solutions that are effective at the enterprise level. 

LLMs ability to convert natural language into actions is powerful in all industries. For materials and chemical R&D specifically, this means LLMs provide the ability to quickly go from an idea in a scientist’s head to the systematic exploration of possibilities, making them excellent candidates as assistants in lab work. In addition, few-shot learning is especially beneficial in materials and chemical research since obtaining large amounts of data for training traditional ML models involves conducting many experiments and is often a significant bottleneck. 

However, being language models, LLMs only work well for a limited set of tasks and do not predict numerical values very well. Also, the tendency of LLMs to ‘hallucinate’ and demonstrate overconfidence in incorrect answers is particularly harmful in research, where there are significant consequences if scientists are misled in arbitrary directions. RAG, however, does help prevent hallucinations by providing critical contextual information within the prompt so that the model does not have to generate such information and can also be used to provide citations, enabling a human expert to quickly distinguish between facts and model hallucinations. Special prompting techniques such as Chain-of-Thought1 or Self-Consistency5 prompt the model to elaborate on its reasoning and/or evaluate the previous thoughts leading to the response. The function calling API can also replace critical parts of the generated model response with output from deterministic functions optimized for a particular subtask.

Use Cases Ready for Adoption Today

Again, two primary use case categories for LLMs in materials science and chemical R&D rise to the top for practical adoption today.

1. Knowledge Extraction & Summarization

LLMs are excellent for knowledge extraction and summarization and are already being utilized in materials science and chemistry for many use cases, for example, auto-generating books that summarize scientific papers6

There are several steps within a materials and chemistry company's workflow where such knowledge summarization is valuable, including but not limited to market research, chemical synthesis planning, computational screening, and responding to customer service requests. In each of these cases, the specifics of search are very different in terms of types of relevant queries, knowledge sources to search over, and kinds of information within these sources that need to be retrieved. For example, when conducting market research for a new material or chemical, one would need to search over websites of competitors, internal documents, and existing patents for determining information such as properties of existing materials, costs, synthesis options, etc. On the other hand, for responding to customer service requests, search needs to be performed over email communications, issue trackers, and internal knowledge bases for names of specific chemicals, issue descriptions etc. To obtain the relevant information in these cases, you may also need specialized tools for extracting it from chemical names, tables, figures, and images. 

RAG is a popular and important technique in domain-specific knowledge extraction. A RAG system can be set up either over an internal repository of documents or an online source of publications such as arXiv or Google Scholar. The performance of the RAG pipeline, however, depends strongly on the performance of the search algorithm that is part of it. Therefore, before optimizing the LLM-based summarization component of the pipeline, it is important to optimize the search algorithm. LLMs can help improve search performance by reranking results from a cheaper algorithm. They can also be used to construct an initial validation data set by asking them the inverse question: given some input chunks of text, generate an appropriate search query which would have these chunks as results.

Bottomline, LLM-based knowledge solutions can be extremely useful in a variety of different ways but need to be tailored to specific use cases to be most effective.

2. Lab Assistants & Automations

As mentioned, due to the natural language capabilities of LLMs, a system of one or more of them can be leveraged to serve as a capable assistant in the lab who automates and abstracts tasks away from the scientist. 

The function calling API is a key feature for building such a system as it provides an interface between the LLM and arbitrary scientific code. The depth of implementation for such an assistant or automation system can range from a simple assistant performing a very specific task to full automation of a complex array of several tasks. The right level of implementation for an organization depends on the organization’s business and the existing workflows, the value generated, as well as the human and physical constraints.

ChemCrow7 and CRESt8 are LLM-based systems that act as a central agent making complex decisions (ex. when to search for new knowledge) and calling on external entities to perform complex actions (ex. executing an automated chemical reaction). Although systems like ChemCrow and CRESt demonstrate the possibility of using LLMs to autonomously perform complex planning and execution, the robustness of such fully autonomous systems has not been investigated. 

Instead of naively targeting full automation, a good starting point for companies could, for example, be a “Molecule Design Assistant”. An LLM-based assistant can be developed that allows chemists to provide chemical ideas in natural language and apply them to molecule design spaces, generate additional molecule candidates, and analyze candidate molecules using various computational tools. The central idea behind such an assistant is to connect the LLM to tools that perform specific tasks such as: 

  • converting SMILES representation to atom coordinates
  • performing DFT calculations of the energy difference between a molecule’s HOMO and LUMO
  • looking up information in public databases (PubChem, etc.)
  • predicting synthesizability/cost
  • recommending retrosynthesis
  • predicting properties using an ML model

Such a solution combines the natural language capabilities of LLMs with the power of modern physics-based and data-driven computational chemistry tools to make it easier for experimental researchers to leverage powerful computational tools that help them with R&D decision-making. 

The Time is Now

LLMs have come on the scene rapidly and the technologies will continue to improve alongside other digital technologies, but leveraging them in materials science and chemical R&D isn't just a speculative venture for some “AI future.” They should be part of your current innovation strategy and can be implemented today. Companies who take advantage of the unprecedented opportunities that LLMs provide will lead the market in an increasingly technology-driven industry.

Contact us if you would like more specific advice and help developing an LLM-based solution that’s right for your company. 

Interested in more? Check out: The Modern Materials Science and Chemistry Lab




  1. Chain-of-thought prompting elicits reasoning in large language models. (2022) Advances in Neural Information Processing Systems, 35, 24824-24837.
  2. RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture. (2024) arXiv preprint arXiv:2401.08406.
  3. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. (2024) Advances in Neural Information Processing Systems, 36.
  4. MaScQA: investigating materials science knowledge of large language models. (2024) Digital Discovery, 3(2), 313-327.
  5. Self-consistency improves chain of thought reasoning in language models. (2022) arXiv preprint arXiv:2203.11171.
  6. ChemDataWriter: a transformer-based toolkit for auto-generating books that summarise research. (2023) Digital Discovery, 2(6), 1710-1720.
  7. ChemCrow: Augmenting large-language models with chemistry tools. (2023) arXiv preprint arXiv:2304.05376.
  8. CRESt – Copilot for Real-world Experimental Scientist. (2023) doi:10.26434/chemrxiv-2023-tnz1x-v4 | Li Group MIT (2023, July 2). CRESt - Copilot for Read-world Experimental Scientists. YouTube. https://www.youtube.com/watch?v=POPPVtGueb0 
Share this article:

Related Content

Digital Transformation vs. Digital Enhancement: A Starting Decision Framework for Technology Initiatives in R&D

Leveraging advanced technology like generative AI through digital transformation (not digital enhancement) is how to get the biggest returns in scientific R&D.

Read More

Digital Transformation in Practice

There is much more to digital transformation than technology, and a holistic strategy is crucial for the journey.

Read More

Leveraging AI for More Efficient Research in BioPharma

In the rapidly-evolving landscape of drug discovery and development, traditional approaches to R&D in biopharma are no longer sufficient. Artificial intelligence (AI) continues to be a...

Read More

Utilizing LLMs Today in Industrial Materials and Chemical R&D

Leveraging large language models (LLMs) in materials science and chemical R&D isn't just a speculative venture for some AI future. There are two primary use...

Read More

Top 10 AI Concepts Every Scientific R&D Leader Should Know

R&D leaders and scientists need a working understanding of key AI concepts so they can more effectively develop future-forward data strategies and lead the charge...

Read More

Why A Data Fabric is Essential for Modern R&D

Scattered and siloed data is one of the top challenges slowing down scientific discovery and innovation today. What every R&D organization needs is a data...

Read More

Jupyter AI Magics Are Not ✨Magic✨

It doesn’t take ✨magic✨ to integrate ChatGPT into your Jupyter workflow. Integrating ChatGPT into your Jupyter workflow doesn’t have to be magic. New tools are…

Read More

Top 5 Takeaways from the American Chemical Society (ACS) 2023 Fall Meeting: R&D Data, Generative AI and More

By Mike Heiber, Ph.D., Materials Informatics Manager Enthought, Materials Science Solutions The American Chemical Society (ACS) is a premier scientific organization with members all over…

Read More

Real Scientists Make Their Own Tools

There’s a long history of scientists who built new tools to enable their discoveries. Tycho Brahe built a quadrant that allowed him to observe the…

Read More

How IT Contributes to Successful Science

With the increasing importance of AI and machine learning in science and engineering, it is critical that the leadership of R&D and IT groups at...

Read More