CWig: compressed representation of Wiggle/BedGraph format

Bioinformatics. 2014 Sep 15;30(18):2543-50. doi: 10.1093/bioinformatics/btu330. Epub 2014 May 27.

Abstract

Motivation: BigWig, a format to represent read density data, is one of the most popular data types. They can represent the peak intensity in ChIP-seq, the transcript expression in RNA-seq, the copy number variation in whole genome sequencing, etc. UCSC Encode project uses the bigWig format heavily for storage and visualization. Of 5.2 TB Encode hg19 database, 1.6 TB (31% of the total space) is used to store bigWig files. BigWig format not only saves a lot of space but also supports fast queries that are crucial for interactive analysis and browsing. In our benchmark, bigWig often has similar size to the gzipped raw data, while is still able to support ∼ 5000 random queries per second.

Results: Although bigWig is good enough at the moment, both storage space and query time are expected to become limited when sequencing gets cheaper. This article describes a new method to store density data named CWig. The format uses on average one-third of the size of existing bigWig files and improves random query speed up to 100 times.

Availability and implementation: http://genome.ddns.comp.nus.edu.sg/∼cwig.

MeSH terms

  • DNA Copy Number Variations
  • Data Compression / methods*
  • Data Mining
  • Databases, Genetic*
  • Genomics / methods*
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Sequence Analysis, RNA