'Genome order index' should not be used for defining compositional constraints in nucleotide sequences--a case study of the Z-curve

Elhaik, Eran; Graur, Dan; Josić, Kresimir

'Genome order index' should not be used for defining compositional constraints in nucleotide sequences--a case study of the Z-curve

Mark

Elhaik, Eran ^LU

; Graur, Dan and Josić, Kresimir (2010) In Biology Direct 5.

Abstract

BACKGROUND: The Z-curve is a three dimensional representation of DNA sequences proposed over a decade ago and has been extensively applied to sequence segmentation, horizontal gene transfer detection, and sequence analysis. Based on the Z-curve, a "genome order index," was proposed, which is defined as S = a2+ c2+t2+g2, where a, c, t, and g are the nucleotide frequencies of A, C, T, and G, respectively. This index was found to be smaller than 1/3 for almost all tested genomes, which was taken as support for the existence of a constraint on genome composition. A geometric explanation for this constraint has been suggested. Each genome was represented by a point P whose distance from the four faces of a regular tetrahedron was given by... (More)

BACKGROUND: The Z-curve is a three dimensional representation of DNA sequences proposed over a decade ago and has been extensively applied to sequence segmentation, horizontal gene transfer detection, and sequence analysis. Based on the Z-curve, a "genome order index," was proposed, which is defined as S = a2+ c2+t2+g2, where a, c, t, and g are the nucleotide frequencies of A, C, T, and G, respectively. This index was found to be smaller than 1/3 for almost all tested genomes, which was taken as support for the existence of a constraint on genome composition. A geometric explanation for this constraint has been suggested. Each genome was represented by a point P whose distance from the four faces of a regular tetrahedron was given by the frequencies a, c, t, and g. They claimed that an inscribed sphere of radius r = 1/ square root 3 contains almost all points corresponding to various genomes, implying that S <r2. The distribution of the points P obtained by S was studied using the Z-curve.

RESULTS: In this work, we studied the basic properties of the Z-curve using the "genome order index" as a case study. We show that (1) the calculation of the radius of the inscribed sphere of a regular tetrahedron is incorrect, (2) the S index is narrowly distributed, (3) based on the second parity rule, the S index can be derived directly from the Shannon entropy and is, therefore, redundant, and (4) the Z-curve suffers from over dimensionality, and the dimension stands for GC content alone suffices to represent any given genome.

CONCLUSION: The "genome order index" S does not represent a constraint on nucleotide composition. Moreover, S can be easily computed from the Gini-Simpson index and be directly derived from entropy and is redundant. Overall, the Z-curve and S are over-complicated measures to GC content and Shannon H index, respectively.

(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/5409bf53-0354-47d6-b0ac-a8780ccf2413

author

Elhaik, Eran ^LU

; Graur, Dan and Josić, Kresimir

publishing date

2010-02-17

type

Contribution to journal

publication status

published

keywords

Base Composition/genetics, Base Sequence/genetics, Computer Simulation, Genome, Bacterial/genetics, Models, Genetic

in

Biology Direct

volume

5

article number

10

pages

7 pages

publisher

BioMed Central (BMC)

external identifiers

pmid:20158921
scopus:77949456442

ISSN

1745-6150

DOI

10.1186/1745-6150-5-10

language

English

LU publication?

no

id

5409bf53-0354-47d6-b0ac-a8780ccf2413

date added to LUP

2019-11-10 16:49:17

date last changed

2026-01-10 10:32:06

@article{5409bf53-0354-47d6-b0ac-a8780ccf2413,
  abstract     = {{<p>BACKGROUND: The Z-curve is a three dimensional representation of DNA sequences proposed over a decade ago and has been extensively applied to sequence segmentation, horizontal gene transfer detection, and sequence analysis. Based on the Z-curve, a "genome order index," was proposed, which is defined as S = a2+ c2+t2+g2, where a, c, t, and g are the nucleotide frequencies of A, C, T, and G, respectively. This index was found to be smaller than 1/3 for almost all tested genomes, which was taken as support for the existence of a constraint on genome composition. A geometric explanation for this constraint has been suggested. Each genome was represented by a point P whose distance from the four faces of a regular tetrahedron was given by the frequencies a, c, t, and g. They claimed that an inscribed sphere of radius r = 1/ square root 3 contains almost all points corresponding to various genomes, implying that S &lt;r2. The distribution of the points P obtained by S was studied using the Z-curve.</p><p>RESULTS: In this work, we studied the basic properties of the Z-curve using the "genome order index" as a case study. We show that (1) the calculation of the radius of the inscribed sphere of a regular tetrahedron is incorrect, (2) the S index is narrowly distributed, (3) based on the second parity rule, the S index can be derived directly from the Shannon entropy and is, therefore, redundant, and (4) the Z-curve suffers from over dimensionality, and the dimension stands for GC content alone suffices to represent any given genome.</p><p>CONCLUSION: The "genome order index" S does not represent a constraint on nucleotide composition. Moreover, S can be easily computed from the Gini-Simpson index and be directly derived from entropy and is redundant. Overall, the Z-curve and S are over-complicated measures to GC content and Shannon H index, respectively.</p>}},
  author       = {{Elhaik, Eran and Graur, Dan and Josić, Kresimir}},
  issn         = {{1745-6150}},
  keywords     = {{Base Composition/genetics; Base Sequence/genetics; Computer Simulation; Genome, Bacterial/genetics; Models, Genetic}},
  language     = {{eng}},
  month        = {{02}},
  publisher    = {{BioMed Central (BMC)}},
  series       = {{Biology Direct}},
  title        = {{'Genome order index' should not be used for defining compositional constraints in nucleotide sequences--a case study of the Z-curve}},
  url          = {{http://dx.doi.org/10.1186/1745-6150-5-10}},
  doi          = {{10.1186/1745-6150-5-10}},
  volume       = {{5}},
  year         = {{2010}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

'Genome order index' should not be used for defining compositional constraints in nucleotide sequences--a case study of the Z-curve