Advanced

Sequence Correlations in HP Model Proteins

Theodoridis, Orestes LU (2018) FYTK02 20181
Computational Biology and Biological Physics
Department of Astronomy and Theoretical Physics
Abstract
Amino acids that are in close contact in a protein structure tend to co-evolve,
which gives rise to sequence correlations. Direct coupling analysis
(DCA) is a method for predicting such contacts directly from sequence correlations,
without assuming any prior knowledge of structures. To this end, sequence
correlations are modeled using an Ising-like ansatz, whose couplings are determined
through an inverse statistical-mechanical calculation. In this work, the problem
of predicting contacts from sequence correlations is investigated in a minimal
lattice-based protein model with only two amino acid types, hydrophobic (H) and
polar (P). A structure for chain length 30 is considered, which is known from
previous work to represent the... (More)
Amino acids that are in close contact in a protein structure tend to co-evolve,
which gives rise to sequence correlations. Direct coupling analysis
(DCA) is a method for predicting such contacts directly from sequence correlations,
without assuming any prior knowledge of structures. To this end, sequence
correlations are modeled using an Ising-like ansatz, whose couplings are determined
through an inverse statistical-mechanical calculation. In this work, the problem
of predicting contacts from sequence correlations is investigated in a minimal
lattice-based protein model with only two amino acid types, hydrophobic (H) and
polar (P). A structure for chain length 30 is considered, which is known from
previous work to represent the minimum-energy state of 813 distinct HP sequences.
Raw sequence correlations (covariances and Pearson correlations) are analyzed,
and a DCA procedure is implemented. It turns out that the five largest couplings
from the DCA calculation correspond to nearest-neighbor contacts in the known
structure. Unfortunately, these five couplings are not well separated from the other ones. On the other hand, knowledge of this limited set of contacts
is essentially sufficient to infer the entire structure of these HP sequences. (Less)
Popular Abstract
Among the vast number of challenges facing science today, predicting protein structure remains a notoriously difficult problem. Since the discovery of DNA and the micro cosmos of living organisms, a long standing wish for scientists has been to decode the stuff that make up life. Today the human genome project is complete, and researchers from different disciplines have turned their attention to proteins. Why proteins? Besides being nutrients in food, proteins are large molecules that reside in cells. Each cell can have up to 100 000 different types of proteins, each of which plays a unique important role. The proteins are much like the workers and engineers inside a society that constitutes a cell. Furthermore proteins play an important... (More)
Among the vast number of challenges facing science today, predicting protein structure remains a notoriously difficult problem. Since the discovery of DNA and the micro cosmos of living organisms, a long standing wish for scientists has been to decode the stuff that make up life. Today the human genome project is complete, and researchers from different disciplines have turned their attention to proteins. Why proteins? Besides being nutrients in food, proteins are large molecules that reside in cells. Each cell can have up to 100 000 different types of proteins, each of which plays a unique important role. The proteins are much like the workers and engineers inside a society that constitutes a cell. Furthermore proteins play an important role in a variety of diseases, like diabetes and Alzheimer's disease.

The cell in itself is highly complex, but the way a protein functions can sometimes be explained by one simple property: the fold. Proteins are chains of (on average) 300 building blocks called amino acids. Depending on the sequence of amino acids, many proteins have a preferred fold (structure) that they wants to curl up to. Adopting this fold is necessary for the protein to be able to carry out its function. Each structure has a family of sequences that prefer that specific fold. One astonishing feature with proteins is that their preferred shape is reached in a very short time period, as if the protein already knows how to fold. This property has led scientists to believe that the specific structure of a given protein must be already coded in the amino acid sequence.

Today the Worldwide Protein Data Base (PDB) continuously releases new structures and protein sequences belonging to specific structures and experimental methods are used to acquire knowledge of both the structure and the amino acid sequence. These efforts have led to a point where much more is known about protein sequences than their structure. Because of this asymmetry, a question scientists make is: given the amino acid sequence of a protein, is there a way to predict its fold?

Numerous attempts to solve this problem have been made. One group of methods that has gained popularity in recent years, and has had promising results, uses experimental sequence data to find patterns in the different proteins that prefer a specific fold. They do this by applying pattern recognition algorithms, and are termed Direct Coupling Analysis (DCA), since their aim is to extract the information that predicts direct contacts between amino acids.

DCA may be applied on real protein sequences, but can also be investigated on simpler protein models, as is the case with my thesis. By using DCA to analyze simple protein models we can gain knowledge of the veracity and consistency of the method. Reducing the problem to two dimensions instead of three and using only two different types of amino acids (instead of 20 in real proteins) greatly improves computational cost and the number of parameters involved. This means that the method can be investigated in a controlled manner.

If a theoretical prediction of protein structure is possible it will play a prominent role in the future of medicine and biology. Besides predicting protein structure, the results from DCA can be used to gain insight as to why failed folds occur.
While it is true that a given protein prefers to fold into a specific structure, the process is prone to errors, meaning proteins sometimes fail to find their preferred fold. It has been found that "misfolds" are important in some diseases, e.g. Alzheimer's disease. (Less)
Please use this url to cite or link to this publication:
author
Theodoridis, Orestes LU
supervisor
organization
course
FYTK02 20181
year
type
M2 - Bachelor Degree
subject
language
English
id
8944652
date added to LUP
2018-06-08 08:49:36
date last changed
2018-06-12 11:12:25
@misc{8944652,
  abstract     = {Amino acids that are in close contact in a protein structure tend to co-evolve,
which gives rise to sequence correlations. Direct coupling analysis
(DCA) is a method for predicting such contacts directly from sequence correlations,
without assuming any prior knowledge of structures. To this end, sequence
correlations are modeled using an Ising-like ansatz, whose couplings are determined
through an inverse statistical-mechanical calculation. In this work, the problem
of predicting contacts from sequence correlations is investigated in a minimal
lattice-based protein model with only two amino acid types, hydrophobic (H) and
polar (P). A structure for chain length 30 is considered, which is known from
previous work to represent the minimum-energy state of 813 distinct HP sequences.
Raw sequence correlations (covariances and Pearson correlations) are analyzed,
and a DCA procedure is implemented. It turns out that the five largest couplings
from the DCA calculation correspond to nearest-neighbor contacts in the known
structure. Unfortunately, these five couplings are not well separated from the other ones. On the other hand, knowledge of this limited set of contacts
is essentially sufficient to infer the entire structure of these HP sequences.},
  author       = {Theodoridis, Orestes},
  language     = {eng},
  note         = {Student Paper},
  title        = {Sequence Correlations in HP Model Proteins},
  year         = {2018},
}