Lineage-associated underrepresented permutations (LAUPs) of mammalian genomic sequences based on a Jellyfish-based LAUPs analysis application (JBLA)

Bioinformatics. 2018 Nov 1;34(21):3624-3630. doi: 10.1093/bioinformatics/bty392.

Abstract

Motivation: This study addresses several important questions related to naturally underrepresented sequences: (i) are there permutations of real genomic DNA sequences in a defined length (k-mer) and a given lineage that do not actually exist or underrepresented? (ii) If there are such sequences, what are their characteristics in terms of k-mer length and base composition? (iii) Are they related to CpG or TpA underrepresentation known for human sequences? We propose that the answers to these questions are of great significance for the study of sequence-associated regulatory mechanisms, such cytosine methylation and chromosomal structures in physiological or pathological conditions such as cancer.

Results: We empirically defined sequences that were not included in any well-known public databases as lineage-associated underrepresented permutations (LAUPs). Then, we developed a Jellyfish-based LAUPs analysis application (JBLA) to investigate LAUPs for 24 representative species. The present discoveries include: (i) lengths for the shortest LAUPs, ranging from 10 to 14, which collectively constitute a low proportion of the genome. (ii) Common LAUPs showing higher CG content over the analysed mammalian genome and possessing distinct CG*CG motifs. (iii) Neither CpG-containing LAUPs nor CpG island sequences are randomly structured and distributed over the genomes; some LAUPs and most CpG-containing sequences exhibit an opposite trend within the same k and n variants. In addition, we demonstrate that the JBLA algorithm is more efficient than the original Jellyfish for computing LAUPs.

Availability and implementation: We developed a Jellyfish-based LAUP analysis (JBLA) application by integrating Jellyfish (Marçais and Kingsford, 2011), MEME (Bailey, et al., 2009) and the NCBI genome database (Pruitt, et al., 2007) applications, which are listed as Supplementary Material.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • CpG Islands
  • Genome*
  • Genomics*
  • Humans
  • Mammals
  • Sequence Analysis, DNA