Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Sep 29;2014:bau093.
doi: 10.1093/database/bau093. Print 2014.

The Cancer Genomics Hub (CGHub): Overcoming Cancer Through the Power of Torrential Data

Affiliations
Free PMC article

The Cancer Genomics Hub (CGHub): Overcoming Cancer Through the Power of Torrential Data

Christopher Wilks et al. Database (Oxford). .
Free PMC article

Abstract

The Cancer Genomics Hub (CGHub) is the online repository of the sequencing programs of the National Cancer Institute (NCI), including The Cancer Genomics Atlas (TCGA), the Cancer Cell Line Encyclopedia (CCLE) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) projects, with data from 25 different types of cancer. The CGHub currently contains >1.4 PB of data, has grown at an average rate of 50 TB a month and serves >100 TB per week. The architecture of CGHub is designed to support bulk searching and downloading through a Web-accessible application programming interface, enforce patient genome confidentiality in data storage and transmission and optimize for efficiency in access and transfer. In this article, we describe the design of these three components, present performance results for our transfer protocol, GeneTorrent, and finally report on the growth of the system in terms of data stored and transferred, including estimated limits on the current architecture. Our experienced-based estimates suggest that centralizing storage and computational resources is more efficient than wide distribution across many satellite labs. Database URL: https://cghub.ucsc.edu.

Figures

Figure 1.
Figure 1.
CGHub and GeneTorrent conceptual system design and flow. (1) Client retrieves a list of downloadable files from the CGHub Web services. (2) Client uses GT to initiate a download. (3) Download is handed to GT distributor after proper security checks have passed. (4) Download is distributed to multiple transfer servers within a pool of available servers. (5) Servers announce themselves to the tracker as serving the requested file(s). (6) Client gets list of servers from the tracker. (7) Client downloads data directly from assigned transfer servers. All sequence data are read from the distributed GPFS.
Figure 2.
Figure 2.
CGHub client security transaction: (1) CGHub retrieves the authorized list of users from the NCI DAC multiple times a day. (2) The client accesses CGHub’s Web page to login and is redirected to NIH’s InCommon federated SSO authentication page over HTTPS where the client logs in using previously issued NIH credentials. (3) A successful authentication is sent back to the client via Shibboleth. (4) The client then sends the authentication via Shibboleth to CGHub’s Web server. (5) CGHub’s Web server then generates a unique cryptographically encoded key for the client, which has a 1-year expiration from the date of issuance, and sends it securely to the client over HTTPS. (6) The client then sends his\her file request with its encrypted identity key from Step 5 over HTTPS to CGHub’s security service. (7) If the client’s key is recognized and authorized to download the requested file, CGHub’s security service will send a signed temporary x509 certificate specific to the requested file back to the client over HTTPS. (8) The client then sends the x509 certificate from Step 7 to the CGHub data server to prove the client’s identity and authorization to download the requested file. (9) The CGHub data server verifies the certificate and establishes an SSL connection using AES-256 for encryption for the file transfer to the client.
Figure 3.
Figure 3.
CGHub’s multi-layered parallelized transfer architecture using GT.
Figure 4.
Figure 4.
Comparison of transfer performance between sFTP, GridFTP and GT with varying numbers of TCP connections and GT server instances.
Figure 5.
Figure 5.
CGHub growth measured by size (GB) of uploads, downloads and number of users per month. The y-axis is in log scale.
Figure 6.
Figure 6.
CGHub cumulative growth: downloaded vs. stored data.
Figure 7.
Figure 7.
CGHub’s outbound firewall usage 1/2012–5/2014 (averaged 24-h intervals, some precision is lost).

Similar articles

  • The UCSC Cancer Genomics Browser: update 2013.
    Goldman M, Craft B, Swatloski T, Ellrott K, Cline M, Diekhans M, Ma S, Wilks C, Stuart J, Haussler D, Zhu J. Goldman M, et al. Nucleic Acids Res. 2013 Jan;41(Database issue):D949-54. doi: 10.1093/nar/gks1008. Epub 2012 Oct 29. Nucleic Acids Res. 2013. PMID: 23109555 Free PMC article.
  • The UCSC Cancer Genomics Browser: update 2015.
    Goldman M, Craft B, Swatloski T, Cline M, Morozova O, Diekhans M, Haussler D, Zhu J. Goldman M, et al. Nucleic Acids Res. 2015 Jan;43(Database issue):D812-7. doi: 10.1093/nar/gku1073. Epub 2014 Nov 11. Nucleic Acids Res. 2015. PMID: 25392408 Free PMC article.
  • TCGA Expedition: A Data Acquisition and Management System for TCGA Data.
    Chandran UR, Medvedeva OP, Barmada MM, Blood PD, Chakka A, Luthra S, Ferreira A, Wong KF, Lee AV, Zhang Z, Budden R, Scott JR, Berndt A, Berg JM, Jacobson RS. Chandran UR, et al. PLoS One. 2016 Oct 27;11(10):e0165395. doi: 10.1371/journal.pone.0165395. eCollection 2016. PLoS One. 2016. PMID: 27788220 Free PMC article.
  • Databases and web tools for cancer genomics study.
    Yang Y, Dong X, Xie B, Ding N, Chen J, Li Y, Zhang Q, Qu H, Fang X. Yang Y, et al. Genomics Proteomics Bioinformatics. 2015 Feb;13(1):46-50. doi: 10.1016/j.gpb.2015.01.005. Epub 2015 Feb 21. Genomics Proteomics Bioinformatics. 2015. PMID: 25707591 Free PMC article. Review.
  • A Primer for Access to Repositories of Cancer-Related Genomic Big Data.
    Torcivia-Rodriguez J, Dingerdissen H, Chang TC, Mazumder R. Torcivia-Rodriguez J, et al. Methods Mol Biol. 2019;1878:1-37. doi: 10.1007/978-1-4939-8868-6_1. Methods Mol Biol. 2019. PMID: 30378067 Review.
See all similar articles

Cited by 77 articles

See all "Cited by" articles

References

    1. Kent W.J., Haussler D. (2001) Assembly of the working draft of the human genome with GigAssembler. Genome Res., 11, 1541–1548 - PMC - PubMed
    1. Wilks C., Maltbie D., Diekhans M., et al. (2013) CGHub: Kick-starting the Worldwide Genome Web. Proc. Asia Pac. Adv. Netw., 35, 1–13
    1. Moore G. (1998). Cramming more components onto integrated circuits. Proc IEEE, 86, 82–85
    1. Li H., Handsaker B., Wysoker A., et al. (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078–2079 - PMC - PubMed
    1. National Institute of Standards and Technology. (2012) Secure Hash Standard, FIPS180-4. NIST, Gaithersburg, MD

Publication types

Feedback