Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 29 (17), 2199-202

The Human Genome Contracts Again

Affiliations

The Human Genome Contracts Again

Dmitri S Pavlichin et al. Bioinformatics.

Abstract

The number of human genomes that have been sequenced completely for different individuals has increased rapidly in recent years. Storing and transferring complete genomes between computers for the purpose of applying various applications and analysis tools will soon become a major hurdle, hindering the analysis phase. Therefore, there is a growing need to compress these data efficiently. Here, we describe a technique to compress human genomes based on entropy coding, using a reference genome and known Single Nucleotide Polymorphisms (SNPs). Furthermore, we explore several intrinsic features of genomes and information in other genomic databases to further improve the compression attained. Using these methods, we compress James Watson's genome to 2.5 megabytes (MB), improving on recent work by 37%. Similar compression is obtained for most genomes available from the 1000 Genomes Project. Our biologically inspired techniques promise even greater gains for genomes of lower organisms and for human genomes as more genomic data become available.

Availability: Code is available at sourceforge.net/projects/genomezip/

Similar articles

  • ERGC: An Efficient Referential Genome Compression Algorithm
    S Saha et al. Bioinformatics 31 (21), 3468-75. PMID 26139636.
    We have done extensive experiments using five real sequencing datasets. The results on real genomes show that our proposed algorithm is indeed competitive and performs be …
  • GDC 2: Compression of Large Collections of Genomes
    S Deorowicz et al. Sci Rep 5, 11565. PMID 26108279.
    The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human geno …
  • CoGI: Towards Compressing Genomes as an Image
    X Xie et al. IEEE/ACM Trans Comput Biol Bioinform 12 (6), 1275-85. PMID 26671800.
    Genomic science is now facing an explosive increase of data thanks to the fast development of sequencing technology. This situation poses serious challenges to genomic da …
  • Current Bioinformatics Tools in Genomic Biomedical Research (Review)
    A Teufel et al. Int J Mol Med 17 (6), 967-73. PMID 16685403. - Review
    On the advent of a completely assembled human genome, modern biology and molecular medicine stepped into an era of increasingly rich sequence database information and hig …
  • Ten Years of Bacterial Genome Sequencing: Comparative-Genomics-Based Discoveries
    TT Binnewies et al. Funct Integr Genomics 6 (3), 165-85. PMID 16773396. - Review
    It has been more than 10 years since the first bacterial genome sequence was published. Hundreds of bacterial genome sequences are now available for comparative genomics, …
See all similar articles

Cited by 9 PubMed Central articles

  • Tackling the Challenges of FASTQ Referential Compression
    A Guerra et al. Bioinform Biol Insights 13, 1177932218821373. PMID 30792576.
    The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics cente …
  • NRGC: A Novel Referential Genome Compression Algorithm
    S Saha et al. Bioinformatics 32 (22), 3405-3412. PMID 27485445.
    We have done rigorous experiments to evaluate NRGC by taking a set of real human genomes. The simulation results show that our algorithm is indeed an effective genome com …
  • smallWig: Parallel Compression of RNA-seq WIG Files
    Z Wang et al. Bioinformatics 32 (2), 173-80. PMID 26424856.
    We tested different variants of the smallWig compression algorithm on a number of integer-and real- (floating point) valued RNA-seq WIG files generated by the ENCODE proj …
  • ERGC: An Efficient Referential Genome Compression Algorithm
    S Saha et al. Bioinformatics 31 (21), 3468-75. PMID 26139636.
    We have done extensive experiments using five real sequencing datasets. The results on real genomes show that our proposed algorithm is indeed competitive and performs be …
  • GDC 2: Compression of Large Collections of Genomes
    S Deorowicz et al. Sci Rep 5, 11565. PMID 26108279.
    The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human geno …
See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback