BigDataScript: a scripting language for data pipelines
- PMID: 25189778
- PMCID: PMC4271142
- DOI: 10.1093/bioinformatics/btu595
BigDataScript: a scripting language for data pipelines
Abstract
Motivation: The analysis of large biological datasets often requires complex processing pipelines that run for a long time on large computational infrastructures. We designed and implemented a simple script-like programming language with a clean and minimalist syntax to develop and manage pipeline execution and provide robustness to various types of software and hardware failures as well as portability.
Results: We introduce the BigDataScript (BDS) programming language for data processing pipelines, which improves abstraction from hardware resources and assists with robustness. Hardware abstraction allows BDS pipelines to run without modification on a wide range of computer architectures, from a small laptop to multi-core servers, server farms, clusters and clouds. BDS achieves robustness by incorporating the concepts of absolute serialization and lazy processing, thus allowing pipelines to recover from errors. By abstracting pipeline concepts at programming language level, BDS simplifies implementation, execution and management of complex bioinformatics pipelines, resulting in reduced development and debugging cycles as well as cleaner code.
Availability and implementation: BigDataScript is available under open-source license at http://pcingola.github.io/BigDataScript.
© The Author 2014. Published by Oxford University Press.
Figures
Similar articles
-
DolphinNext: a distributed data processing platform for high throughput genomics.BMC Genomics. 2020 Apr 19;21(1):310. doi: 10.1186/s12864-020-6714-x. BMC Genomics. 2020. PMID: 32306927 Free PMC article.
-
Canary: an atomic pipeline for clinical amplicon assays.BMC Bioinformatics. 2017 Dec 15;18(1):555. doi: 10.1186/s12859-017-1950-z. BMC Bioinformatics. 2017. PMID: 29246107 Free PMC article.
-
Bio-Docklets: virtualization containers for single-step execution of NGS pipelines.Gigascience. 2017 Aug 1;6(8):1-7. doi: 10.1093/gigascience/gix048. Gigascience. 2017. PMID: 28854616 Free PMC article.
-
Using R and Bioconductor in Clinical Genomics and Transcriptomics.J Mol Diagn. 2020 Jan;22(1):3-20. doi: 10.1016/j.jmoldx.2019.08.006. Epub 2019 Oct 9. J Mol Diagn. 2020. PMID: 31605800 Review.
-
Federated sharing and processing of genomic datasets for tertiary data analysis.Brief Bioinform. 2021 May 20;22(3):bbaa091. doi: 10.1093/bib/bbaa091. Brief Bioinform. 2021. PMID: 34020536 Review.
Cited by
-
MXP: Modular eXpandable framework for building bioinformatics Pipelines.J Bioinform Syst Biol. 2023;6(3):178-182. doi: 10.26502/jbsb.5107058. Epub 2023 Aug 7. J Bioinform Syst Biol. 2023. PMID: 37920684 Free PMC article.
-
SnakeLines: integrated set of computational pipelines for sequencing reads.J Integr Bioinform. 2023 Aug 21;20(3):20220059. doi: 10.1515/jib-2022-0059. eCollection 2023 Sep 1. J Integr Bioinform. 2023. PMID: 37602733 Free PMC article.
-
Adducted Thumb and Peripheral Polyneuropathy: Diagnostic Supports in Suspecting White-Sutton Syndrome: Case Report and Review of the Literature.Genes (Basel). 2021 Jun 22;12(7):950. doi: 10.3390/genes12070950. Genes (Basel). 2021. PMID: 34206215 Free PMC article.
-
PEMA: a flexible Pipeline for Environmental DNA Metabarcoding Analysis of the 16S/18S ribosomal RNA, ITS, and COI marker genes.Gigascience. 2020 Mar 1;9(3):giaa022. doi: 10.1093/gigascience/giaa022. Gigascience. 2020. PMID: 32161947 Free PMC article.
-
Bioinformatics pipeline using JUDI: Just Do It!Bioinformatics. 2020 Apr 15;36(8):2572-2574. doi: 10.1093/bioinformatics/btz956. Bioinformatics. 2020. PMID: 31882996 Free PMC article.
References
-
- Feldman SI. Make—a program for maintaining computer program. Software. 1979;9:255–265.
-
- Goodstadt L. Ruffus: a lightweight python library for computational pipelines. Bioinformatics. 2010;26:2778–2779. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
