Accurate Prediction of the Statistics of Repetitions in Random Sequences: A Case Study in Archaea Genomes

Front Bioeng Biotechnol. 2016 Jun 8:4:35. doi: 10.3389/fbioe.2016.00035. eCollection 2016.

Abstract

Repetitive patterns in genomic sequences have a great biological significance and also algorithmic implications. Analytic combinatorics allow to derive formula for the expected length of repetitions in a random sequence. Asymptotic results, which generalize previous works on a binary alphabet, are easily computable. Simulations on random sequences show their accuracy. As an application, the sample case of Archaea genomes illustrates how biological sequences may differ from random sequences.

Keywords: K-mers; combinatorics; probability.