Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 1;8(5):giz044.
doi: 10.1093/gigascience/giz044.

SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines

Affiliations

SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines

Samuel Lampa et al. Gigascience. .

Abstract

Background: The complex nature of biological data has driven the development of specialized software tools. Scientific workflow management systems simplify the assembly of such tools into pipelines, assist with job automation, and aid reproducibility of analyses. Many contemporary workflow tools are specialized or not designed for highly complex workflows, such as with nested loops, dynamic scheduling, and parametrization, which is common in, e.g., machine learning.

Findings: SciPipe is a workflow programming library implemented in the programming language Go, for managing complex and dynamic pipelines in bioinformatics, cheminformatics, and other fields. SciPipe helps in particular with workflow constructs common in machine learning, such as extensive branching, parameter sweeps, and dynamic scheduling and parametrization of downstream tasks. SciPipe builds on flow-based programming principles to support agile development of workflows based on a library of self-contained, reusable components. It supports running subsets of workflows for improved iterative development and provides a data-centric audit logging feature that saves a full audit trace for every output file of a workflow, which can be converted to other formats such as HTML, TeX, and PDF on demand. The utility of SciPipe is demonstrated with a machine learning pipeline, a genomics, and a transcriptomics pipeline.

Conclusions: SciPipe provides a solution for agile development of complex and dynamic pipelines, especially in machine learning, through a flexible application programming interface suitable for scientists used to programming or scripting.

Keywords: Go; Golang; flow-based programming; machine learning; pipelines; reproducibility; scientific workflow management systems.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A simple example workflow implemented with SciPipe. The workflow computes the reverse base complement of a string of DNA, using standard UNIX tools. The workflow is a Go program and is supposed to be saved in a file with the .go extension and executed with the go run command. On line 4, the SciPipe library is imported, to be later accessed as scipipe. On line 7, a short string of DNA is defined. On lines 9–33, the full workflow is implemented in the program’s main() function, meaning that it will be executed when the resulting program is executed. On line 11, a new workflow object (or “struct” in Go terms) is initiated with a name and the maximum number of cores to use. On lines 15–25, the workflow components, or "processes," are initiated, each with a name and a shell command pattern. Input file names are defined with a placeholder on the form {i:INPORTNAME} and outputs on the form {o:OUTPORTNAME}. The port-name will be used later to access the corresponding ports for setting up data dependencies. On line 16, a component that writes the previously defined DNA string to a file is initiated, and on line 17, the file path pattern for the out-port "dna" is defined (in this case a static file name). On line 20, a component that translates each DNA base to its complementary counterpart is initiated. On line 21, the file path pattern for its only out-port is defined. In this case, reusing the file path of the file it will receive on its in-port named "in," thus the {i:in} part. The %.txt part removes .txt from the input path. On line 24, a component that will reverse the DNA string is initiated. On lines 27–29, data dependencies are defined via the in- and out-ports defined earlier as part of the shell command patterns. On line 32, the workflow is being run. This code example is also available in the SciPipe source code repository Lampa [29].
Figure 2
Figure 2
Example audit log file in JSON format [30] for a file produced by a SciPipe workflow. The workflow used to produce this audit log in particular is the one in Fig. 1. The audit information is hierarchical, with each level representing a step in the workflow. The first level contains metadata about the task executed last, to produce the output file that this audit log refers to. The field Upstream on each level contains a list of all tasks upstream of the current task, indexed by the file paths that each of the upstream tasks produced, and which was subsequently used by the current task. Each task is given a globally unique ID, which helps to deduplicate any duplicate occurrences of tasks when converting the log to other representations. Execution time is given in nanoseconds. Note that input paths in the command field are prepended with ../, compared to how they appear in the Upstream field. This is because each task is executed in a temporary directory created directly under the workflow’s main execution directory, meaning that to access existing data files, it has to first navigate up 1 step out of this temporary directory.
Figure 3
Figure 3
Comparison between dataflow variables and FBP ports in terms of dependency definition. A, How dataflow variables (blue and green) shared between processes (in grey) make the processes tightly coupled. In other words, process and network definitions get intermixed. B, How ports (in orange) bound to processes in FBP allow the network definition to be kept separate from process definitions. This enables processes to be reconnected freely without changing their internals.
Figure 4
Figure 4
Audit report for the last file generated by the code example in Fig. 1, converted to TeX with SciPipe’s experimental audit2tex feature and then converted to PDF with pdfTeX. At the top, the PDF file includes summary information about the SciPipe version used and the total execution time. After this follows an execution timeline, in a Gantt chart style, that shows the relative execution times of individual tasks in a graphical manner. After this follows a comprehensive list of tables with information for each task executed towards producing the file to which the audit report belongs. The task boxes are color-coded and ordered in the same way that the tasks appear in the timeline.
Figure 5
Figure 5
Directed graph of the machine learning in drug discovery case study workflow, plotted with SciPipe’s workflow plotting function. The graph has been modified for clarity by collapsing the individual branches of the parameter sweeps and cross-validation fold generation. The layout has also been manually made more compact to be viewable in print. The collapsed branches are indicated by intervals in the box labels. tr{500-8000} represent branching into training data set sizes 500, 1,000, 2,000, 4,000, 8,000. c{0.0001-5.0000} represent cost values 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 2, 3, 4, and 5, while fld{1-10} represent cross-validation folds 1−10. Nodes represent processes, while edges represent data dependencies. The labels on the edge heads and tails represent ports. Solid lines represent data dependencies via files, while dashed lines represent data dependencies via parameters, which are not persisted to file, only transmitted via RAM.
Figure 6
Figure 6
Directed graph of workflow processes in the Genomics/Cancer Analysis pre-processing pipeline, plotted with SciPipe’s workflow plotting function. Nodes represent processes, while edges represent data dependencies. The labels on the edge heads and tails represent ports.
Figure 7
Figure 7
Directed graph of workflow processes in the RNA-Seq Pre-processing workflow, plotted with SciPipe’s workflow plotting function. Nodes represent processes, while edges represent data dependencies. The labels on the edge heads and tails represent ports.

Similar articles

Cited by

References

    1. Gehlenborg N, O’donoghue SI, Baliga NS, et al. .. Visualization of omics data for systems biology. Nat Methods. 2010;7(3s):S56. - PubMed
    1. Ritchie MD, Holzinger ER, Li R, et al. .. Methods of integrating data to uncover genotype–phenotype interactions. Nat Rev Genet. 2015;16(2):85. - PubMed
    1. Marx V. Biology: The big challenges of big data. Nature. 2013;498(7453):255–60. - PubMed
    1. Stephens ZD, Lee SY, Faghri F, et al. .. Big data: Astronomical or genomical?. PLoS Biol. 2015;13(7):1–11. - PMC - PubMed
    1. Spjuth O, Bongcam-Rudloff E, Hernández GC, et al. .. Experiences with workflows for automating data-intensive bioinformatics. Biol Direct. 2015;10:43. - PMC - PubMed

Publication types