AC: A Compression Tool for Amino Acid Sequences

Interdiscip Sci. 2019 Mar;11(1):68-76. doi: 10.1007/s12539-019-00322-1. Epub 2019 Feb 5.


Advancement of protein sequencing technologies has led to the production of a huge volume of data that needs to be stored and transmitted. This challenge can be tackled by compression. In this paper, we propose AC, a state-of-the-art method for lossless compression of amino acid sequences. The proposed method works based on the cooperation between finite-context models and substitutional tolerant Markov models. Compared to several general-purpose and specific-purpose protein compressors, AC provides the best bit-rates. This method can also compress the sequences nine times faster than its competitor, paq8l. In addition, employing AC, we analyze the compressibility of a large number of sequences from different domains. The results show that viruses are the most difficult sequences to be compressed. Archaea and bacteria are the second most difficult ones, and eukaryota are the easiest sequences to be compressed.

Keywords: Compression; Finite-context model; Kolmogorov complexity; Protein; Substitutional tolerant Markov model.

MeSH terms

  • Algorithms*
  • Amino Acid Sequence
  • Data Compression*
  • High-Throughput Nucleotide Sequencing / methods*
  • Markov Chains
  • Software*