A Machine Learning Framework for In Silico Screening of AAV Capsid Libraries Using Protein Language Models

Steindorff, Jaro

A Machine Learning Framework for In Silico Screening of AAV Capsid Libraries Using Protein Language Models

Mark

Steindorff, Jaro (2025) BINP52 20242
Degree Projects in Bioinformatics

Abstract: Adeno-associated viruses (AAVs) are among the most promising vectors for safe and effective gene delivery in therapeutic applications. However, engineering functional AAV capsids remains a major challenge due to the vast sequence space, unpredictable biological performance, and limited translational success of many variants. Machine learning offers a powerful strategy to address these limitations by learning complex sequence-function relationships from high-throughput experimental data. In particular, recent advances in protein language models provide a biologically informed framework for encoding and interpreting protein sequences. This thesis investigates the integration of machine learning and protein language models to enable in silico... (More); Adeno-associated viruses (AAVs) are among the most promising vectors for safe and effective gene delivery in therapeutic applications. However, engineering functional AAV capsids remains a major challenge due to the vast sequence space, unpredictable biological performance, and limited translational success of many variants. Machine learning offers a powerful strategy to address these limitations by learning complex sequence-function relationships from high-throughput experimental data. In particular, recent advances in protein language models provide a biologically informed framework for encoding and interpreting protein sequences. This thesis investigates the integration of machine learning and protein language models to enable in silico screening of engineered AAV capsids. Central to this work is the comprehensive analysis of sequencing data from rationally designed and semi-random AAV libraries, supported by detailed data quality assessments to ensure a high-confidence dataset suitable for machine learning. After curating this dataset, we explored a range of sequence representations and model architectures to evaluate which machine-learning approaches were best suited to predicting AAV capsid functionality. Fine-tuning a protein language model on peptide insertion sequences, combined with a broader contextual window from the AAV capsid protein, yielded a binary predictive model capable of accurately classifying AAV variants based on packaging efficiency and infectivity using only primary amino acid sequences. This enabled in silico screening of novel capsid libraries, reducing the experimental burden and increasing the likelihood of identifying functional variants. The framework developed in this work offers a scalable, data-driven approach to AAV capsid engineering and demonstrates the utility of protein language models for classifying variant functionality directly from primary amino acid sequences. (Less)
Popular Abstract: Using Machine Learning to Discover Better Viral Carriers for Gene Therapy

Gene therapy is seen as a promising way to treat genetic diseases, with adeno-associated viruses (AAVs) often used as carriers to deliver therapeutic genes into cells. Designing AAVs by modifying the protein that forms their outer shell, called the capsid, can be challenging, as even small changes can drastically affect their performance. To help address this problem, machine learning was combined with protein language models to study how these small changes in the protein sequence relate to AAV capsid function. Protein language models are tools that analyze protein sequences in a way similar to how language models like ChatGPT interpret human language.

AAVs... (More); Using Machine Learning to Discover Better Viral Carriers for Gene Therapy

Gene therapy is seen as a promising way to treat genetic diseases, with adeno-associated viruses (AAVs) often used as carriers to deliver therapeutic genes into cells. Designing AAVs by modifying the protein that forms their outer shell, called the capsid, can be challenging, as even small changes can drastically affect their performance. To help address this problem, machine learning was combined with protein language models to study how these small changes in the protein sequence relate to AAV capsid function. Protein language models are tools that analyze protein sequences in a way similar to how language models like ChatGPT interpret human language.

AAVs are small viruses widely used in gene therapy for their safe and effective delivery of genetic material across various tissues. However, naturally occurring (or "wild-type") AAVs are not always ideal: they can be blocked by the immune system, may not reach the right tissues, and sometimes fail to deliver their therapeutic payload. To address these limitations, researchers are now designing modified versions of AAVs, by altering the capsid, to create variants with better targeting, stronger delivery, and greater resistance to immune responses. However, only a small fraction of the possible designs work well, and testing millions of variants in the lab is time-consuming and costly. This is where machine learning becomes especially valuable. By analyzing large experimental datasets, machine learning can learn which sequences tend to succeed or fail. In particular, protein language models provide a way to translate protein sequences into a format that machine learning models can learn from. Together, these models help capture the hidden "rules" of protein function, making it possible to predict AAV variants that are more likely to work before they are ever built in the lab.

This project has two main goals. The first is to develop a data processing pipeline that transforms raw experimental data, generated from AAV capsid variants and the tissues they target, into a structured format that both highlights variant performance and is ready for machine learning applications. The second focus is to explore a range of machine learning models and sequence representation techniques to determine which approaches best capture the relationship between AAV protein sequences and their effectiveness. We aimed to predict AAV capsid performance directly from their protein sequence, to enable computer-based screening to identify promising variants before lab testing. This approach could help streamline early development by reducing the need for extensive experimental work.

Our results suggest that protein language models can offer useful improvements for predicting AAV capsid functionality based on sequence data, performing slightly better than traditional methods in most cases. Fine-tuning these models on specific AAV data led to gains in predictive accuracy. While not a complete solution, this approach shows potential for helping prioritize variants for further testing and reducing some of the experimental workload in AAV capsid development by machine learning.

Master’s Degree Project in Bioinformatics 60 credits 2025
Department of Biology, Lund University

Advisor: Dr. Patrick Aldrin-Kirk
rAAVen Therapeutics (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9212655

author

Steindorff, Jaro

supervisor

Patrick Aldrin-Kirk ^LU

organization

Degree Projects in Bioinformatics

course

BINP52 20242

year

2025

type

H2 - Master's Degree (Two Years)

subject

Biology and Life Sciences

language

English

id

9212655

date added to LUP

2025-09-18 08:56:01

date last changed

2025-09-18 08:56:01

@misc{9212655,
  abstract     = {{Adeno-associated viruses (AAVs) are among the most promising vectors for safe and effective gene delivery in therapeutic applications. However, engineering functional AAV capsids remains a major challenge due to the vast sequence space, unpredictable biological performance, and limited translational success of many variants. Machine learning offers a powerful strategy to address these limitations by learning complex sequence-function relationships from high-throughput experimental data. In particular, recent advances in protein language models provide a biologically informed framework for encoding and interpreting protein sequences. This thesis investigates the integration of machine learning and protein language models to enable in silico screening of engineered AAV capsids. Central to this work is the comprehensive analysis of sequencing data from rationally designed and semi-random AAV libraries, supported by detailed data quality assessments to ensure a high-confidence dataset suitable for machine learning. After curating this dataset, we explored a range of sequence representations and model architectures to evaluate which machine-learning approaches were best suited to predicting AAV capsid functionality. Fine-tuning a protein language model on peptide insertion sequences, combined with a broader contextual window from the AAV capsid protein, yielded a binary predictive model capable of accurately classifying AAV variants based on packaging efficiency and infectivity using only primary amino acid sequences. This enabled in silico screening of novel capsid libraries, reducing the experimental burden and increasing the likelihood of identifying functional variants. The framework developed in this work offers a scalable, data-driven approach to AAV capsid engineering and demonstrates the utility of protein language models for classifying variant functionality directly from primary amino acid sequences.}},
  author       = {{Steindorff, Jaro}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{A Machine Learning Framework for In Silico Screening of AAV Capsid Libraries Using Protein Language Models}},
  year         = {{2025}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

A Machine Learning Framework for In Silico Screening of AAV Capsid Libraries Using Protein Language Models