Skip to main content

Discovering Drugs with the Help of Machine Learning

Person in white coat pointing to AI button

Person in white coat pointing to AI button (Suriphon Singha, Getty Images)

Person in white coat pointing to AI button

Person in white coat pointing to AI button (Suriphon Singha, Getty Images)

6.6

How does this align with my curriculum?

Share on:

Learn how machine learning is being used in protein drug discovery.

Every cell in the human body contains proteins. These molecules are some of the building blocks of life. The human body is made up of billions of them and they are essential for helping us digest food, move and even think. Proteins help build and repair the body's tissues. They also keep the body working as it should. There are thousands of different proteins in the human body. Each one has one or more specific functions.

Proteins as Drugs

More and more proteins are now used in the development of drugs that treat illness and disease. Finding a new protein for a drug is often like finding a needle in a haystack. Researchers look at thousands of proteins found in nature until they find one that comes close to doing what they want it to do. Then comes the long and difficult process of getting the protein to do what they want it to in the body without any negative effects. All this is hard and takes lots of time and resources. More than half the time the process fails to deliver a protein that works.

The Protein Folding Problem

Proteins are made from long chains of molecular building blocks called amino acids. There are 20 different types of amino acids that make up most proteins. Amino acids can link up in many different ways to form proteins with a variety of shapes and sizes. The order of the amino acids determines the 3-dimensional structure of a protein. The structure determines the protein's function in the body.

Shown is a colour illustration of proteins in dozens of shapes, sizes and colours, on a white background.
3D images of a variety of naturally occurring proteins both within and outside of a cell (Source: Screen grab from the Protein Data Bank https://cdn.rcsb.org/pdb101/molecular-machinery/).
Image - Text Version

Shown is a colour illustration of proteins in dozens of shapes, sizes and colours, on a white background. All the proteins have a grainy texture, but that is the only thing they have in common. Along the left are long, thin structures that look like threads of blue and purple twisted together. Next to it are what look like scattered orange breadcrumbs. In between is a tangle of thick red strands. Below is a pink sphere with a pattern that looks like blur flowers. To the right is something that looks like a hollow grey tube with different coloured strands wrapped around it in clumps. To the right are blue and purple clumps shaped like triangles, squares, rings and snowflakes. In the centre is a thick stripe that looks as if it is checkered in teal and purple. Next to it are two thick, lumpy strands of teal and purple twisted together. To the right is a thick, lumpy red strand, tangled at one end. Along the right edge is what looks like a long brown vine with green lumps along its length. In between are small to medium clumps in different shapes. They range in colour from purple and red to purple and teal.

Determining the order of amino acids coded by human genes used to be a lot of work. But thanks to the Human Genome Project, researchers are now able to do it very quickly. What they cannot do quickly is figure out how the chains of amino acids fold up and carry out a protein's function. This is because there are so many ways a protein can fold.

Shown is a colour illustration of proteins during the four steps of folding.
Steps in protein folding (Source: Adapted from an image by AMGEN. Used with permission).
Image - Text Version

Shown is a colour illustration of proteins during the four steps of folding. The title, “Protein Folding” is in bold letters across the top centre. Below are four small illustrations with descriptions. The first illustration shows a thin black strand spaced with brightly coloured dots, like beads in a necklace. The beads are labelled “Amino Acid,” and the strands between are labelled “Peptide Bond.” The description reads, “Primary Structure: The linear sequence of amino acids forms a chain.” In the second illustration, the coloured beads are now spaced out onto two new surfaces. The first is a bright blue ribbon, curled into a spiral. This is labelled “Alpha Helix.” The second is a long, pale blue sheet with an arrow pointing up at the top. This looks like it has been folded, accordion-style. This is labelled “Beta-pleated sheet.” The description below reads, “Secondary Structure: Short segments of the chain form into 3D structures that include alpha helices and beta sheets. In the third illustration, one bright blue spiral and two pale blue sheets from the previous image are piled on top of each other inside what looks like a fluffy white cloud. Thin black strands join the ends of each sheet to each other, as well as the spiral. The description reads “Tertiary Structure: The whole chain forms its 3D shape when segments fold up next to each other.” In the fourth illustration, two piles of sheets and spirals are next to each other in a larger white cloud. The description reads “Quaternary Structure: Often more than one chain come together to form a final protein structure.”

To predict the ways a protein could fold takes a huge amount of computing power. This is where Artificial Intelligence (AI) and Machine Learning (ML) can help.

Machine Learning and Protein Folding

On July 22, 2021, DeepMind, a part of Google, published research on proteins and ML. ML was used to predict the structures of about 100 000 proteins. The researchers used a system called AlphaFoldAlphaFold uses protein data to learn how to predict protein structures. Even though the model's predictions aren't perfect, they are getting better every day. RoseTTAFold is a similar tool. It was developed by the Institute for Protein Design (IPD) at the University of Washington.

It is important to note that it takes a very large amount of data about proteins to help design new protein drugs. This data is mostly collected from lab tests and clinical studies on patients.

Proteins and Generative Biology

Finding and understanding natural proteins takes a very long time. But what if scientists could figure out a way to design protein drugs faster and with greater success? Or better still, what if they could skip the process of finding a protein in nature and just design one from scratch? This is where generative biology comes in.

Generative biology is about using computers to learn from data to generate new data.

Shown is a colour illustration of the steps involved in two different approaches to protein drug discovery.
Traditional protein drug discovery versus generative biology (Source: Adapted from an image by AMGEN. Used with permission).
Image - Text Version

Shown is a colour illustration of the steps involved in two different approaches to protein drug discovery. The title, “Protein Drug Discovery” is in bold letters across the top. Below, the illustration is divided into two sections. The section on the left is subtitled “The Traditional Approach.” Starting at the top, the first illustration is a haystack labelled “Molecules to test.” A green arrow points from here down to three test tubes filled with red liquid. This is labelled “Wet lab experiments.” Another green arrow points from here down to the final illustration, a single sewing needle. This is labelled “Final drug design.” The section on the right is subtitled “The Generative Approach: Tell a computational model what you want and let it propose designs to test.” Around this is an oval-shaped diagram, with illustrations joined by blue arrows. At the top of the oval is a clipboard and pen. This is labelled “Specifications (wish list).” Arrows lead from here to an illustration that looks like two spiderwebs connected by gears. This is labelled “Computational model.” The next step has an illustration of three sewing needles. This is labelled “Candidate drug designs.” Next are three test tubes of red liquid labelled “Wet lab experiments.” The final illustration is a single sewing needle. This is labelled “Final drug design.” Arrows lead from here back up to the first illustration.

For example, researchers can use data about proteins to train computer models. The more data that is put into these models, the better, faster and more successful these models will be. In the future, the computer models could learn how to make any protein people might want.

Did you know?

The term generative biology comes from the models they use. We call these generative computer models.

Making Connections

Predicting protein structure is not all that RoseTTAFold and AlphaFold can do. They can now also model how proteins connect (bind) to each other. Being able to see how proteins bind to each other is a key aspect of drug development.

ML could be used to create specific proteins that would bind to a specific target. This would be much faster than making them from scratch in the lab.

Shown is a colour computer rendering of protein molecules at the membrane of a nerve cell.
Small proteins binding with a large receptor protein in the membrane of a nerve cell (Source: JUAN GAERTNER/SCIENCE PHOTO LIBRARY via Getty Images).
Image - Text Version

Shown is a colour computer rendering of protein molecules at the membrane of a nerve cell. Stretched horizontally across the illustration is a thick layer of tightly packed gold strands with small purple lumps along the top and bottom surfaces. This represents the phospholipid layer of a cell. Above this, small clumps of gold lumps float against a pale blue background. These represent small proteins. Below, two tall piles of purple lumps sit on top of the horizontal surface. These represent larger proteins embedded in the phospholipid layer. Below, more purple clumps float on a dark blue background, close to the bottom of the horizontal surface. These represent free-floating proteins. In the centre of the image, a few gold lumps and a few purple lumps have joined together across the horizontal layer. These represent proteins binding together in the membrane.

The hope is that ML can help figure out what proteins are useful in fewer steps and with fewer surprises. By decreasing research time drug companies can get treatments to people even faster than before.

Let’s Talk Science appreciates the contributions of Natasha Bond from Amgen in the development of this backgrounder.

What Are Proteins and What Is Their Function in the Body? (2019)
This page from Food Facts for Healthy Choices has information about proteins and their function in the body.

Protein Structure and Folding (2018)
Explore protein folding that occurs within levels of protein structure with the Amoeba Sisters!

Khan Academy: Biology Lesson 5 Proteins
This series of resources includes information on amino acids and protein structure.

References

Amgen. (2022, July 6). Generative Biology: Designing Biologic Medicines with Greater Speed and Success.

Beam, A. & Gibson, M. (2019, Nov. 11). The Coming Age of Generative Biology. Flagship Pioneering.

Callaway, E. (2020, November 30). “it will change everything”: Deepmind’s ai makes gigantic leap in solving protein structures. Nature News. https://www.nature.com/articles/d41586-020-03348-4

Dill, Ken A et al. The protein folding problem. Annual review of biophysics vol. 37 (2008): 289-316. DOI: 10.1146/annurev.biophys.37.092707.153558