The Importance of Large Language Models in Science Even If You Don’t Work With Language

Jun 11, 2023|Life Sciences, Materials Science, Technology, Transformation

OpenAI's ChatGPT, Google's Bard, and other similar Large Language Models (LLMs) have made dramatic strides in their ability to interact with people using natural language. Users can describe what they want done and have the LLM "understand" and respond appropriately.

While R&D leaders and their scientists are aware of ChatGPT, most are unclear what the technology means for them because their data isn’t natural language. Scientific data is different from traditional business data and requires special handling. Much of R&D data isn't text. It's time series or images or video or spectra or molecular structures or any number of other data sources in a myriad of formats.

Even if your primary data isn't text-based, every lab has significant work with text: reports, code, configuration files, and so on.

The technology behind tools like ChatGPT provide a new flexibility that can lift a significant amount of the burden of the text-based workflows that every lab has. More importantly, the advances in AI models that permit ChatGPT's dramatically better conversational context are revolutionizing the ability of AI models to work with deeper relationships in non-language data as well.

This means that innovative research organizations are in a unique position to benefit from these new types of tools. LLMs have the potential to take away some of the drudgery and distraction of text-based tasks like report generation and code writing, letting the domain experts in your organization focus on what they are best at—the science.

The Challenge of R&D Automation

Automation usually requires very standardized processes and is ideal for organizations that are doing the same thing over-and-over, whether on the factory floor, producing sales reports, or drafting business documents - any variables are well understood and constrained. R&D is, almost by definition, the opposite of standardized. R&D organizations are constantly taking on different projects, trying out new equipment, and testing new processes. Successful automation in an R&D context needs to be flexible without needing constant human intervention.

The most obvious difference between this new generation of language models and the older ones (embodied in tools like Siri and Alexa), is the ability to keep the context of the conversation over a much longer time. These large language models can "remember" what you are talking about over many back-and-forth prompts and responses. This is possible due to advances in the architecture of the AI models which allow the models to be trained more efficiently permitting deeper context from the same resources.

The innovations from text-based models can be applied just as readily to other data types. New designs can be built and efficiently trained to learn to recognize relationships in other situations, such as tracking cause and effect in time series or video data, or spatial relationships in images. While they aren't getting the same level of coverage in the popular press as the text-based models, we are starting to see these sorts of technologies emerging, such as ChemBERT in drug discovery. It is likely that as they mature they will start to provide the same sort of qualitative changes in the analysis of scientific data.

LLMs and Scientific Text Workflows

At its heart ChatGPT is just trying to come up with the "best next word" over and over, building up its responses one word at a time. In some sense, an LLM is just a "sophisticated autocomplete." Therefore, these models are very good at producing semi-structured text, such as computer code, configuration files, and standardized reports (and also answers to exam questions!), because semi-structured text is even more predictable than natural language. Of course, to be able to do this, the model has to be trained on appropriate examples of the desired output, but ChatGPT has demonstrated surprising adeptness at producing small but useful routines in common programming languages just from the code examples included in its general training data, without any specific additional training.

Semi-structured text is very common in R&D contexts. It might be an algorithm to perform an analysis of some data; a section of a report on the results of an experiment; or perhaps a SQL query against a knowledge base. It may never be the same each time, but there are general patterns that it follows and expectations in formatting and style. In a traditional lab, writing these documents generally falls on the researchers, and amounts to a significant change in the flow of their work. They are no longer thinking about the research problem, but instead thinking about computer code or getting data into a document or how to connect to the database.

Leaders of R&D organizations would much rather have their scientists, engineers, and researchers focus on doing science, engineering and research. Research at UC Irvine¹ shows that it can take up to 20 minutes to get focus back on the primary task after a distraction. By leveraging LLM-based tools to generate structured text through conversational prompts, the researcher is more likely to be focused on the high-level research task. In the same way that regular autocomplete can speed up sending a text message, keeping you focused on the message you want to send, these tools can speed up the creation of other types of text while keeping focus on the larger task.

Of course, just as with regular autocomplete, sometimes LLMs will get things wrong, and so they still need a human in the loop.

Beyond LLMs: Transformers

To build AI models which can track this sort of flexible structure, AI researchers developed the concept of attention: parts of the model that are designed to track important information that should influence later output. Many different attention methods have been developed over the past 30 years, but the best results until recently all required underlying neural networks which are comparatively complex - so called recurrent neural networks - that use their current state as an input for the next state. These are algorithmically expensive to train, because you need to build things up sequentially. This also makes it harder to break up the work amongst multiple machines.

LLMs: Transformers

The breakthrough that allowed tools like ChatGPT to be built was the idea of transformers: a type of attention mechanism that works holistically on a chunk of a document rather than sequentially, and which can be implemented using simpler, non-recurrent networks. This removed an algorithmic bottleneck in the way that these models are trained allowing work to be done in parallel more easily. More training data could be fed into the model while using fewer resources, and deeper context models could be built. This permitted the broader knowledge displayed by the current LLMs (coming from more input data) as well as the greatly improved ability to carry out conversations with humans (coming from the deeper context).

But again, a lot of R&D data isn't text. You can't, at the moment, give ChatGPT an image or a series of spectra as an input for it to work with without somehow first turning the data into text that ChatGPT understands.

At the level of the AI model, everything is just vectors of numbers, so these transformer-based algorithmic improvements can be applied to other data sources. In computer vision and image processing, for example, better context means better tracking of things over time, better distinction of spatial relationships like "close to" or "to the left", or the ability to distinguish things based on environmental cues. In drug discovery, we are now seeing rapid improvements in the ability to predict biochemical properties of molecules from molecular structure data—the improved ability to track context allows the AI to link active subgroups that work together to produce a particular chemical property even though they may be far apart in the molecular structure description.

Over the next few years, we are likely to see transformative improvements in the analysis of many different types of scientific and engineering data thanks to the advances that led to ChatGPT and its friends.

To learn more, join Enthought AI experts for the webinar What Every R&D Leader Needs to Know about ChatGPT and LLMs for a deeper dive into how these advanced technologies are changing scientific research.

¹ The Cost of Interrupted Work: More Speed and Stress, University of California, Irvine

The Importance of Large Language Models in Science Even If You Don’t Work With Language

The Challenge of R&D Automation

LLMs and Scientific Text Workflows

Beyond LLMs: Transformers

Share this article:

Related Content

Digital Transformation vs. Digital Enhancement: A Starting Decision Framework for Technology Initiatives in R&D

Digital Transformation in Practice

Leveraging AI for More Efficient Research in BioPharma

Utilizing LLMs Today in Industrial Materials and Chemical R&D

Top 10 AI Concepts Every Scientific R&D Leader Should Know

Why A Data Fabric is Essential for Modern R&D

Jupyter AI Magics Are Not ✨Magic✨

Top 5 Takeaways from the American Chemical Society (ACS) 2023 Fall Meeting: R&D Data, Generative AI and More

Real Scientists Make Their Own Tools

How IT Contributes to Successful Science

Industries

Solutions

Discover