Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jan 1;31(1):10-6.
doi: 10.1093/bioinformatics/btu595. Epub 2014 Sep 3.

BigDataScript: a scripting language for data pipelines

Affiliations

BigDataScript: a scripting language for data pipelines

Pablo Cingolani et al. Bioinformatics. .

Abstract

Motivation: The analysis of large biological datasets often requires complex processing pipelines that run for a long time on large computational infrastructures. We designed and implemented a simple script-like programming language with a clean and minimalist syntax to develop and manage pipeline execution and provide robustness to various types of software and hardware failures as well as portability.

Results: We introduce the BigDataScript (BDS) programming language for data processing pipelines, which improves abstraction from hardware resources and assists with robustness. Hardware abstraction allows BDS pipelines to run without modification on a wide range of computer architectures, from a small laptop to multi-core servers, server farms, clusters and clouds. BDS achieves robustness by incorporating the concepts of absolute serialization and lazy processing, thus allowing pipelines to recover from errors. By abstracting pipeline concepts at programming language level, BDS simplifies implementation, execution and management of complex bioinformatics pipelines, resulting in reduced development and debugging cycles as well as cleaner code.

Availability and implementation: BigDataScript is available under open-source license at http://pcingola.github.io/BigDataScript.

PubMed Disclaimer

Figures

Listing 1.
Listing 1.
pipeline.bds program. A simple pipeline example featuring a task invoking a fictitious command ‘myProcess’ defined to require 2 CPUs and a maximum of 6 h of execution time (Line 5)
Listing 2.
Listing 2.
pipeline_2.bds program. A two-step pipeline with task dependencies. The first step (line 9) requires to run ‘myProcess’ command on a hundred input files, which can be executed in parallel. The second step (line 19) processes the output of those hundred files and creates a single output file (using fictitious ‘myProcessAll’ command). It should be noted that we never explicitly state which hardware we are using: (i) if the pipeline is run on a dual-core computer, as each process requires 2 CPUs, one ‘myProcess’ instance will be executed at the time until the 100 tasks are completed; (ii) if it is run on a 64-core server, then 32 ‘myProcess’ instances will be executed in parallel; (iii) if it is run on a cluster, then 100 ‘myProcess’ instances will be scheduled and the cluster resource management system will decide how to execute them; and (iv) if it is run on a single-core computer, execution will fail owing to lack of resources. Thus, the pipeline runs independent of the underlying architecture. The task defined in line 18 depends on all the outputs from tasks in line 8 (‘mainOut <− outs’)
Fig. 1.
Fig. 1.
BDS report showing pipeline’s task execution timeline
Fig. 2.
Fig. 2.
Execution example. (A) Script ‘pipeline.bds’. (B) The script is executed from a terminal. The GO executable invokes main BDS, written in JAVA, performs lexing, parsing, compilation to AST and runs AST. (C) When the task statement is run, appropriate checks are performed. (D) A shell script ‘task1.sh’ is created, and a bds-exec process is fired. (E) bds-exec reports PID, executed the script ‘task1.sh’ while capturing stdout and stderr as well as monitoring timeouts and OS signals. When a process finishes execution, the exit status is logged
Fig. 3.
Fig. 3.
Whole-genome sequencing analysis pipeline’s flow chart, showing how computations are split across many nodes

Similar articles

Cited by

References

    1. Auwera GA, et al. From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr. Protoc. Bioinformatics. 2013 11.10. 11–11.10. 33. - PMC - PubMed
    1. Cingolani P, et al. Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Frontiers in genetics. 2012a;3:35. - PMC - PubMed
    1. Cingolani P, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. 2012b;6:80–92. - PMC - PubMed
    1. Feldman SI. Make—a program for maintaining computer program. Software. 1979;9:255–265.
    1. Goodstadt L. Ruffus: a lightweight python library for computational pipelines. Bioinformatics. 2010;26:2778–2779. - PubMed

Publication types