Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Identifying compositionally homogeneous and nonhomogeneous domains within the human genome using a novel segmentation algorithm

Elhaik, Eran LU orcid ; Graur, Dan ; Josić, Kresimir and Landan, Giddy (2010) In Nucleic Acids Research 38(15).
Abstract

It has been suggested that the mammalian genome is composed mainly of long compositionally homogeneous domains. Such domains are frequently identified using recursive segmentation algorithms based on the Jensen-Shannon divergence. However, a common difficulty with such methods is deciding when to halt the recursive partitioning and what criteria to use in deciding whether a detected boundary between two segments is real or not. We demonstrate that commonly used halting criteria are intrinsically biased, and propose IsoPlotter, a parameter-free segmentation algorithm that overcomes such biases by using a simple dynamic halting criterion and tests the homogeneity of the inferred domains. IsoPlotter was compared with an alternative... (More)

It has been suggested that the mammalian genome is composed mainly of long compositionally homogeneous domains. Such domains are frequently identified using recursive segmentation algorithms based on the Jensen-Shannon divergence. However, a common difficulty with such methods is deciding when to halt the recursive partitioning and what criteria to use in deciding whether a detected boundary between two segments is real or not. We demonstrate that commonly used halting criteria are intrinsically biased, and propose IsoPlotter, a parameter-free segmentation algorithm that overcomes such biases by using a simple dynamic halting criterion and tests the homogeneity of the inferred domains. IsoPlotter was compared with an alternative segmentation algorithm, D(JS), using two sets of simulated genomic sequences. Our results show that IsoPlotter was able to infer both long and short compositionally homogeneous domains with low GC content dispersion, whereas D(JS) failed to identify short compositionally homogeneous domains and sequences with low compositional dispersion. By segmenting the human genome with IsoPlotter, we found that one-third of the genome is composed of compositionally nonhomogeneous domains and the remaining is a mixture of many short compositionally homogeneous domains and relatively few long ones.

(Less)
Please use this url to cite or link to this publication:
author
; ; and
publishing date
type
Contribution to journal
publication status
published
keywords
Algorithms, Base Composition, Computer Simulation, Genome, Human, Genomics/methods, Humans, Isochores, Models, Genetic
in
Nucleic Acids Research
volume
38
issue
15
article number
e158
pages
9 pages
publisher
Oxford University Press
external identifiers
  • pmid:20571085
  • scopus:77956108176
ISSN
1362-4962
DOI
10.1093/nar/gkq532
language
English
LU publication?
no
id
41588645-5376-4ead-a3a8-afae4d1cfebe
date added to LUP
2019-11-10 16:48:16
date last changed
2024-01-01 23:32:46
@article{41588645-5376-4ead-a3a8-afae4d1cfebe,
  abstract     = {{<p>It has been suggested that the mammalian genome is composed mainly of long compositionally homogeneous domains. Such domains are frequently identified using recursive segmentation algorithms based on the Jensen-Shannon divergence. However, a common difficulty with such methods is deciding when to halt the recursive partitioning and what criteria to use in deciding whether a detected boundary between two segments is real or not. We demonstrate that commonly used halting criteria are intrinsically biased, and propose IsoPlotter, a parameter-free segmentation algorithm that overcomes such biases by using a simple dynamic halting criterion and tests the homogeneity of the inferred domains. IsoPlotter was compared with an alternative segmentation algorithm, D(JS), using two sets of simulated genomic sequences. Our results show that IsoPlotter was able to infer both long and short compositionally homogeneous domains with low GC content dispersion, whereas D(JS) failed to identify short compositionally homogeneous domains and sequences with low compositional dispersion. By segmenting the human genome with IsoPlotter, we found that one-third of the genome is composed of compositionally nonhomogeneous domains and the remaining is a mixture of many short compositionally homogeneous domains and relatively few long ones.</p>}},
  author       = {{Elhaik, Eran and Graur, Dan and Josić, Kresimir and Landan, Giddy}},
  issn         = {{1362-4962}},
  keywords     = {{Algorithms; Base Composition; Computer Simulation; Genome, Human; Genomics/methods; Humans; Isochores; Models, Genetic}},
  language     = {{eng}},
  number       = {{15}},
  publisher    = {{Oxford University Press}},
  series       = {{Nucleic Acids Research}},
  title        = {{Identifying compositionally homogeneous and nonhomogeneous domains within the human genome using a novel segmentation algorithm}},
  url          = {{http://dx.doi.org/10.1093/nar/gkq532}},
  doi          = {{10.1093/nar/gkq532}},
  volume       = {{38}},
  year         = {{2010}},
}