Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Extracting protein flexibility features from a pre-trained protein language model

Karlin, Vera LU (2024) KEMR30 20232
Department of Chemistry
Abstract
The deep learning revolution has contributed to a great leap in protein structure prediction, but predicting the multiple conformational states that proteins can hold remains an open problem. This thesis approaches this problem through the use of MSA Transformer, a protein language model pre-trained on multiple sequence alignments, that has demonstrated the ability to learn protein properties such as contacts and secondary structure. The aim was to investigate if MSA Transformer has learned to represent flexibility features of proteins during pre-training and if that information can be captured somehow. This mainly took the shape of training neural networks on the outputs of the transformer model to predict the local RMSD flexibility... (More)
The deep learning revolution has contributed to a great leap in protein structure prediction, but predicting the multiple conformational states that proteins can hold remains an open problem. This thesis approaches this problem through the use of MSA Transformer, a protein language model pre-trained on multiple sequence alignments, that has demonstrated the ability to learn protein properties such as contacts and secondary structure. The aim was to investigate if MSA Transformer has learned to represent flexibility features of proteins during pre-training and if that information can be captured somehow. This mainly took the shape of training neural networks on the outputs of the transformer model to predict the local RMSD flexibility metric of flexible proteins in the PDBFlex database. None of the attempted networks succeeded with classifying flexibility, most likely because of problems with the chosen architectures, evaluation metric and dataset. Much more promising was the discovery of patterns in the attention maps of MSA Transformer which displayed correlation to the local RMSD metric. The findings were expanded upon by the identification of certain attention heads that seemingly correlated with specific types of flexibility, but further investigation is required to draw any conclusions. Prior studies have shown that multiple sequence alignments can be clustered and modified in order to sample multiple conformations with AlphaFold2. While not attempted in this project, the groundwork has been laid for expanding on these approaches with the use of MSA Transformer features to select residues and sequences of specific significance. (Less)
Popular Abstract (Swedish)
AI tar oss ett steg närmare en lösning till gåtan bakom flexibla proteiner

Proteiner är byggstenarna till det mikroskopiska maskineri som utför i stort sett alla biologiska processer i naturen. När vi vill utveckla läkemedel för att lösa problem som kan uppstå i maskineriet eller bekämpa angripande patogener är förståelse av proteiners 3-dimensionella struktur ovärderlig. Visualisering av proteiners struktur kan vara en lång och problemfylld process, så försök har gjorts till att förutsäga (prediktera) strukturer från deras aminosyrasekvenser med hjälp av artificiell intelligens. De senaste åren har utvecklingen av AI-modeller för just det syftet blomstrat, men de kan ännu inte förutsäga den flexibilitet som tillåter många proteiner att... (More)
AI tar oss ett steg närmare en lösning till gåtan bakom flexibla proteiner

Proteiner är byggstenarna till det mikroskopiska maskineri som utför i stort sett alla biologiska processer i naturen. När vi vill utveckla läkemedel för att lösa problem som kan uppstå i maskineriet eller bekämpa angripande patogener är förståelse av proteiners 3-dimensionella struktur ovärderlig. Visualisering av proteiners struktur kan vara en lång och problemfylld process, så försök har gjorts till att förutsäga (prediktera) strukturer från deras aminosyrasekvenser med hjälp av artificiell intelligens. De senaste åren har utvecklingen av AI-modeller för just det syftet blomstrat, men de kan ännu inte förutsäga den flexibilitet som tillåter många proteiner att byta struktur för att utföra sina funktioner.

Projektet har försökt närma sig proteiners flexibilitet med hjälp av språkmodellen MSA Transformer, ett neuralt nätverk med liknande arkitektur till den populära språkmodellen ChatGPT, men tränad på proteinsekvenser istället för text. Liksom textbaserade språkmodeller har MSA Transformer lärt sig egenskaper som den inte var direkt tränad för, såsom 3-dimensionell struktur. Målet med projektet var att undersöka om modellen lärt sig upptäcka flexibilitet hos proteiner och se om den egenskapen kan komma till användning. Det har i första hand inneburit träning av neurala nätverk på outputs från MSA Transformer med målet att prediktera proteinflexibilitet. En rad olika arkitekturer användes, men ingen av dem gav några lyckade resultat.

Förberedelser gjordes även för att använda MSA Transformer tillsammans med strukturmodellen AlphaFold 2 för att prediktera möjliga strukturer som protein kan ta. När aktiveringarna från MSA Transformer undersöktes upptäcktes en förvånande korrelation till proteinflexibilitet. Vidare undersökning visade fler samband med specifika aktiveringar som skulle kunna användas för att prediktera flexibilitet och möjliga strukturer i framtida projekt. Aktiveringar i neurala nätverk är dock ökända för att vara svårbegripliga, så mer arbete behövs för att dra konkreta slutsatser, men fynden kan potentiellt ha stor betydelse. (Less)
Please use this url to cite or link to this publication:
author
Karlin, Vera LU
supervisor
organization
course
KEMR30 20232
year
type
H2 - Master's Degree (Two Years)
subject
keywords
biochemistry, protein language model, MSA Transformer
language
English
id
9146229
date added to LUP
2024-05-13 08:40:02
date last changed
2024-05-13 08:40:02
@misc{9146229,
  abstract     = {{The deep learning revolution has contributed to a great leap in protein structure prediction, but predicting the multiple conformational states that proteins can hold remains an open problem. This thesis approaches this problem through the use of MSA Transformer, a protein language model pre-trained on multiple sequence alignments, that has demonstrated the ability to learn protein properties such as contacts and secondary structure. The aim was to investigate if MSA Transformer has learned to represent flexibility features of proteins during pre-training and if that information can be captured somehow. This mainly took the shape of training neural networks on the outputs of the transformer model to predict the local RMSD flexibility metric of flexible proteins in the PDBFlex database. None of the attempted networks succeeded with classifying flexibility, most likely because of problems with the chosen architectures, evaluation metric and dataset. Much more promising was the discovery of patterns in the attention maps of MSA Transformer which displayed correlation to the local RMSD metric. The findings were expanded upon by the identification of certain attention heads that seemingly correlated with specific types of flexibility, but further investigation is required to draw any conclusions. Prior studies have shown that multiple sequence alignments can be clustered and modified in order to sample multiple conformations with AlphaFold2. While not attempted in this project, the groundwork has been laid for expanding on these approaches with the use of MSA Transformer features to select residues and sequences of specific significance.}},
  author       = {{Karlin, Vera}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Extracting protein flexibility features from a pre-trained protein language model}},
  year         = {{2024}},
}