CyVerse for Reproducible Research: RNA-Seq Analysis

Methods Mol Biol. 2022:2443:57-79. doi: 10.1007/978-1-0716-2067-0_3.

Abstract

Posing complex research questions poses complex reproducibility challenges. Datasets may need to be managed over long periods of time. Reliable and secure repositories are needed for data storage. Sharing big data requires advance planning and becomes complex when collaborators are spread across institutions and countries. Many complex analyses require the larger compute resources only provided by cloud and high-performance computing infrastructure. Finally at publication, funder and publisher requirements must be met for data availability and accessibility and computational reproducibility. For all of these reasons, cloud-based cyberinfrastructures are an important component for satisfying the needs of data-intensive research. Learning how to incorporate these technologies into your research skill set will allow you to work with data analysis challenges that are often beyond the resources of individual research institutions. One of the advantages of CyVerse is that there are many solutions for high-powered analyses that do not require knowledge of command line (i.e., Linux) computing. In this chapter we will highlight CyVerse capabilities by analyzing RNA-Seq data. The lessons learned will translate to doing RNA-Seq in other computing environments and will focus on how CyVerse infrastructure supports reproducibility goals (e.g., metadata management, containers), team science (e.g., data sharing features), and flexible computing environments (e.g., interactive computing, scaling).

Keywords: Cloud computing; Containers; Cyberinfrastructure; Data life cycle; Kallisto; Metadata; RNA-Seq; Reproducible research; Workflow management.

MeSH terms

  • Big Data*
  • Cloud Computing
  • Data Analysis
  • RNA-Seq
  • Reproducibility of Results
  • Software*