ΑΙhub.org
 

Viruses are doing mysterious things everywhere – AI can help researchers understand what they’re up to in the oceans and in your gut


by
24 June 2024



share this:
Novel_Coronavirus_SARS-CoV-2

By Libusha Kelly, Albert Einstein College of Medicine

Viruses are a mysterious and poorly understood force in microbial ecosystems. Researchers know they can infect, kill and manipulate human and bacterial cells in nearly every environment, from the oceans to your gut. But scientists don’t yet have a full picture of how viruses affect their surrounding environments in large part because of their extraordinary diversity and ability to rapidly evolve.

Communities of microbes are difficult to study in a laboratory setting. Many microbes are challenging to cultivate, and their natural environment has many more features influencing their success or failure than scientists can replicate in a lab.

So systems biologists like me often sequence all the DNA present in a sample – for example, a fecal sample from a patient – separate out the viral DNA sequences, then annotate the sections of the viral genome that code for proteins. These notes on the location, structure and other features of genes help researchers understand the functions viruses might carry out in the environment and help identify different kinds of viruses. Researchers annotate viruses by matching viral sequences in a sample to previously annotated sequences available in public databases of viral genetic sequences.

However, scientists are identifying viral sequences in DNA collected from the environment at a rate that far outpaces our ability to annotate those genes. This means researchers are publishing findings about viruses in microbial ecosystems using unacceptably small fractions of available data.

To improve researchers’ ability to study viruses around the globe, my team and I have developed a novel approach to annotate viral sequences using artificial intelligence. Through protein language models akin to large language models like ChatGPT but specific to proteins, we were able to classify previously unseen viral sequences. This opens the door for researchers to not only learn more about viruses, but also to address biological questions that are difficult to answer with current techniques.

Annotating viruses with AI

Large language models use relationships between words in large datasets of text to provide potential answers to questions they are not explicitly “taught” the answer to. When you ask a chatbot “What is the capital of France?” for example, the model is not looking up the answer in a table of capital cities. Rather, it is using its training on huge datasets of documents and information to infer the answer: “The capital of France is Paris.”

Similarly, protein language models are AI algorithms that are trained to recognize relationships between billions of protein sequences from environments around the world. Through this training, they may be able to infer something about the essence of viral proteins and their functions.

We wondered whether protein language models could answer this question: “Given all annotated viral genetic sequences, what is this new sequence’s function?”

In our proof of concept, we trained neural networks on previously annotated viral protein sequences in pre-trained protein language models and then used them to predict the annotation of new viral protein sequences. Our approach allows us to probe what the model is “seeing” in a particular viral sequence that leads to a particular annotation. This helps identify candidate proteins of interest either based on their specific functions or how their genome is arranged, winnowing down the search space of vast datasets.

Microscopy image of spherical bacteria colored bright green
Prochlorococcus is one of the many species of marine bacteria with proteins that researchers haven’t seen before. Anne Thompson/Chisholm Lab, MIT via Flickr

By identifying more distantly related viral gene functions, protein language models can complement current methods to provide new insights into microbiology. For example, my team and I were able to use our model to discover a previously unrecognized integrase – a type of protein that can move genetic information in and out of cells – in the globally abundant marine picocyanobacteria Prochlorococcus and Synechococcus. Notably, this integrase may be able to move genes in and out of these populations of bacteria in the oceans and enable these microbes to better adapt to changing environments.

Our language model also identified a novel viral capsid protein that is widespread in the global oceans. We produced the first picture of how its genes are arranged, showing it can contain different sets of genes that we believe indicates this virus serves different functions in its environment.

These preliminary findings represent only two of thousands of annotations our approach has provided.

Analyzing the unknown

Most of the hundreds of thousands of newly discovered viruses remain unclassified. Many viral genetic sequences match protein families with no known function or have never been seen before. Our work shows that similar protein language models could help study the threat and promise of our planet’s many uncharacterized viruses.

While our study focused on viruses in the global oceans, improved annotation of viral proteins is critical for better understanding the role viruses play in health and disease in the human body. We and other researchers have hypothesized that viral activity in the human gut microbiome might be altered when you’re sick. This means that viruses may help identify stress in microbial communities.

However, our approach is also limited because it requires high-quality annotations. Researchers are developing newer protein language models that incorporate other “tasks” as part of their training, particularly predicting protein structures to detect similar proteins, to make them more powerful.

Making all AI tools available via FAIR Data Principles – data that is findable, accessible, interoperable and reusable – can help researchers at large realize the potential of these new ways of annotating protein sequences leading to discoveries that benefit human health.The Conversation

Libusha Kelly, Associate Professor of Systems and Computational Biology, Microbiology and Immunology, Albert Einstein College of Medicine

This article is republished from The Conversation under a Creative Commons license. Read the original article.




The Conversation is an independent source of news and views, sourced from the academic and research community and delivered direct to the public.
The Conversation is an independent source of news and views, sourced from the academic and research community and delivered direct to the public.




            AIhub is supported by:


Related posts :



Interview with AAAI Fellow Roberto Navigli: multilingual natural language processing

  21 Mar 2025
Roberto tells us about his career path, some big research projects he’s led, and why it’s important to follow your passion.

Museums have tons of data, and AI could make it more accessible − but standardizing and organizing it across fields won’t be easy

  20 Mar 2025
How can AI models help organize large amounts of data from different collections, and what are the challenges?

Shlomo Zilberstein wins the 2025 ACM/SIGAI Autonomous Agents Research Award

  19 Mar 2025
Congratulations to Shlomo Zilberstein on winning this prestigious award!

#AAAI2025 workshops round-up 1: Artificial intelligence for music, and towards a knowledge-grounded scientific research lifecycle

  18 Mar 2025
We hear from the organisers of two workshops at AAAI2025 and find out the key takeaways from their events.

The Good Robot podcast: Re-imagining voice assistants with Stina Hasse Jørgensen and Frederik Juutilainen

  17 Mar 2025
Eleanor and Kerry chat to Stina Hasse Jørgensen and Frederik Juutilainen about an experimental research project that created an alternative voice assistant.

Visualizing research in the age of AI

  14 Mar 2025
Felice Frankel discusses the implications of generative AI when communicating science visually.

#IJCAI panel on communicating about AI with the public

  13 Mar 2025
A recording of this session at IJCAI2024 is now available to watch.

Interview with Tunazzina Islam: Understand microtargeting and activity patterns on social media

  11 Mar 2025
Hear from Doctoral Consortium participant Tunazzina about her research on computational social science, natural language processing, and social media mining and analysis




AIhub is supported by:






©2024 - Association for the Understanding of Artificial Intelligence


 












©2021 - ROBOTS Association