Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Dec;26(12):1687-1696.
doi: 10.1101/gr.206870.116. Epub 2016 Oct 27.

A Privacy-Preserving Solution for Compressed Storage and Selective Retrieval of Genomic Data

Affiliations
Free PMC article

A Privacy-Preserving Solution for Compressed Storage and Selective Retrieval of Genomic Data

Zhicong Huang et al. Genome Res. .
Free PMC article

Abstract

In clinical genomics, the continuous evolution of bioinformatic algorithms and sequencing platforms makes it beneficial to store patients' complete aligned genomic data in addition to variant calls relative to a reference sequence. Due to the large size of human genome sequence data files (varying from 30 GB to 200 GB depending on coverage), two major challenges facing genomics laboratories are the costs of storage and the efficiency of the initial data processing. In addition, privacy of genomic data is becoming an increasingly serious concern, yet no standard data storage solutions exist that enable compression, encryption, and selective retrieval. Here we present a privacy-preserving solution named SECRAM (Selective retrieval on Encrypted and Compressed Reference-oriented Alignment Map) for the secure storage of compressed aligned genomic data. Our solution enables selective retrieval of encrypted data and improves the efficiency of downstream analysis (e.g., variant calling). Compared with BAM, the de facto standard for storing aligned genomic data, SECRAM uses 18% less storage. Compared with CRAM, one of the most compressed nonencrypted formats (using 34% less storage than BAM), SECRAM maintains efficient compression and downstream data processing, while allowing for unprecedented levels of security in genomic data storage. Compared with previous work, the distinguishing features of SECRAM are that (1) it is position-based instead of read-based, and (2) it allows random querying of a subregion from a BAM-like file in an encrypted form. Our method thus offers a space-saving, privacy-preserving, and effective solution for the storage of clinical genomic data.

Figures

Figure 1.
Figure 1.
SECRAM framework for compressed, encrypted storage of genomic data. (A) The sequencing read-based format used in BAM is transposed into a genome position-based format (B). The read-based format can be reconstructed from the position-based format via reverse transposition. (C) The position-based storage is compressed and decompressed using a reference-based compression technique. In the table, “S-G” stands for substitution with base “G”, “D-3” stands for deletion of three bases, and “I-AT” stands for insertion of two bases “AT.” (D) The compressed position-based storage is encrypted to generate the final SECRAM format using order-preserving encryption (OPE) and traditional symmetric encryption (SE) scheme. “OPE(POSITION)” represents the OPE ciphertext of POSITION, and “SE(VARIANT)” represents the SE ciphertext of VARIANT. Metadata are not encrypted (e.g., quality scores, mapping quality, read name). The compressed format is recovered from the encrypted format by running the respective decryption algorithms. Our encryption enables efficient selective retrieval. For instance, if a user wants to retrieve data in the range [10, 24], the database executes a normal search between OPE(10) and OPE(24) based on the order-preserving property of OPE, and in the shown example, two positions, OPE(12) and OPE(23), are returned.
Figure 2.
Figure 2.
Storage analysis of compression algorithms for genomic data. (A) Storage comparison on simulated data sets. It shows the average number of bits per base (compression ratio) for three storage formats: BAM, CRAM, and SECRAM. The results are based on simulated data sets with different coverages (from one to 50) and error rates (0.01%, 0.1%, 1%) for both paired and unpaired data. (B) Storage comparison on Chromosome 11 from the 1000 Genomes Project participants with an average coverage of 3×. Only SECRAM is an encrypted format, whereas both BAM and CRAM are in plaintext. Both A and B show that the compression ratio of SECRAM is between that of BAM and that of CRAM.
Figure 3.
Figure 3.
Runtime analysis of SECRAM system for storing and retrieving genomic data. (A) Runtime breakdowns on simulated data sets for the six most important procedures in SECRAM: transposition, inverse transposition, compression, decompression, encryption, and decryption, relative to total runtime (=100%). The experiments were repeated for four simulated data sets with low (1×)/high (50×) coverages and low (0.01%)/high (1%) error rates. In all cases, the runtime cost of encryption/decryption is comparable with other necessary procedures, showing that enforcing security does not result in significant performance overhead. (B) Conversion time on real data sets (same data sets as those in Fig. 2B). It shows the average conversion time between the BAM and SECRAM formats on real data sets, running with a single thread on a machine equipped with Mac OS X Yosemite system and 3.1-GHz Intel Core i7 processor. The black bars (transposition, compression, encryption) represent the three steps of conversion from BAM to SECRAM, whereas the light gray bars (decryption, decompression, inverse transposition) represent the three steps of conversion from SECRAM to BAM. Each step takes <0.25 sec per megabyte of data. (C) Retrieval time on real data sets (same data sets as those in Fig. 2B). It shows the average runtime cost of retrieving data within a range of 1 million genomic positions. The actual data size corresponding to 1 million positions depends on the coverage. Shown are experiments on real data sets from Figure 2B with a coverage of about 3× and a size of slightly >1 megabyte.
Figure 4.
Figure 4.
(A) Storage and (B) runtime on high-coverage clinical data. The data come from a public cell line (NA12878), based on a gene panel that includes the CFTR gene, queried for diagnosing cystic fibrosis. The average coverage of these data is 1035×, containing more than 4 million reads of length around 300 bases. ARHMH2013 (Ayday et al. 2013) is a privacy-preserving solution on BAM files that does not address the compression requirement. We observe a consistent compression performance of SECRAM on high-coverage clinical sequence data. Moreover, querying for the CFTR gene on SECRAM takes less than twice the time on the nonencrypted CRAM.
Figure 5.
Figure 5.
Solutions for encrypting genomic data. (A) A standard SE solution on CRAM-compressed data. It leads to potential data leakage when querying specific genomic regions. The encryption is performed over each individual data block. As a block in CRAM usually contains multiple sequencing reads, it is usually the case that the retrieved block will reveal reads or positions that are not in the query position range. (B) SECRAM encryption. Our solution encrypts each block in SECRAM format based on positions. Position information is encrypted with OPE; the compressed content at each position, with SE. This ensures that only information corresponding to the query position range is retrieved and decrypted. “OPE(POSITION)” represents the OPE ciphertext of POSITION, and “SE(VARIANT)” represents the SE ciphertext of VARIANT. Metadata are not encrypted (e.g., quality scores, mapping quality, read name). Note that OPE preserves the order of the positions, namely, 23596 < 50723 < 71641 because 7 < 12 < 23, but SE encrypts the original message to a random string, for example, from “S-G” to “jkljsdfoy4r5.”

Similar articles

See all similar articles

Cited by 2 articles

  • Predicting the Occurrence of Variants in RAG1 and RAG2.
    Lawless D, Lango Allen H, Thaventhiran J; NIHR BioResource–Rare Diseases Consortium, Hodel F, Anwar R, Fellay J, Walter JE, Savic S. Lawless D, et al. J Clin Immunol. 2019 Oct;39(7):688-701. doi: 10.1007/s10875-019-00670-z. Epub 2019 Aug 6. J Clin Immunol. 2019. PMID: 31388879 Free PMC article.
  • Privacy-preserving techniques of genomic data-a survey.
    Aziz MMA, Sadat MN, Alhadidi D, Wang S, Jiang X, Brown CL, Mohammed N. Aziz MMA, et al. Brief Bioinform. 2019 May 21;20(3):887-895. doi: 10.1093/bib/bbx139. Brief Bioinform. 2019. PMID: 29121240 Free PMC article. Review.

Publication types

LinkOut - more resources

Feedback