Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr 1;13(1):195.
doi: 10.1186/s13104-020-05038-w.

Automated Gene Data Integration With Databio

Free PMC article

Automated Gene Data Integration With Databio

Robert W Reid et al. BMC Res Notes. .
Free PMC article


Objective: Although sequencing and other high-throughput data production technologies are increasingly affordable, data analysis and interpretation remains a significant factor in the cost of -omics studies. Despite the broad acceptance of findable, accessible, interoperable, and reusable (FAIR) data principles which focus on data discoverability and annotation, data integration remains a significant bottleneck in linking prior work in order to better understand novel research. Relevant and timely information discovery is difficult for increasingly multi-disciplinary projects when scientists cannot easily keep up with work across multiple fields. Computational tools are necessary to accurately describe data contents, and empower linkage to existing resources without prior knowledge of the various database resources.

Results: We developed the Databio tool, accessible at, to automate data parsing, identifier detection, and streamline common tasks to provide a point-and-click approach to data manipulation and integration in life sciences research and translational medicine. Databio uses fast real-time data structures and a data warehouse of 137 million identifiers, with automated heuristics to describe data provenance without highly specialized knowledge or bioinformatics training.

Keywords: Data integration; Knowledge discovery; Workflow automation.

Conflict of interest statement

The authors declare that they have no competing interests.


Fig. 1
Fig. 1
Databio web interface workflow showing data upload (including Excel formatting, headers, and merged fields). Point-and-click field mapping allows selection of source and replacement gene identifiers. Results are then exported with new identifiers. Statistics, bibliography, and provenance files are included in download archive but not shown

Similar articles

  • High performance workflow implementation for protein surface characterization using grid technology.
    Merelli I, Morra G, D'Agostino D, Clematis A, Milanesi L. Merelli I, et al. BMC Bioinformatics. 2005 Dec 1;6 Suppl 4(Suppl 4):S19. doi: 10.1186/1471-2105-6-S4-S19. BMC Bioinformatics. 2005. PMID: 16351745 Free PMC article.
  • qPortal: A platform for data-driven biomedical research.
    Mohr C, Friedrich A, Wojnar D, Kenar E, Polatkan AC, Codrea MC, Czemmel S, Kohlbacher O, Nahnsen S. Mohr C, et al. PLoS One. 2018 Jan 19;13(1):e0191603. doi: 10.1371/journal.pone.0191603. eCollection 2018. PLoS One. 2018. PMID: 29352322 Free PMC article.
  • Biowep: a workflow enactment portal for bioinformatics applications.
    Romano P, Bartocci E, Bertolini G, De Paoli F, Marra D, Mauri G, Merelli E, Milanesi L. Romano P, et al. BMC Bioinformatics. 2007 Mar 8;8 Suppl 1(Suppl 1):S19. doi: 10.1186/1471-2105-8-S1-S19. BMC Bioinformatics. 2007. PMID: 17430563 Free PMC article.
  • Mitochondrial Disease Sequence Data Resource (MSeqDR): a global grass-roots consortium to facilitate deposition, curation, annotation, and integrated analysis of genomic data for the mitochondrial disease clinical and research communities.
    Falk MJ, Shen L, Gonzalez M, Leipzig J, Lott MT, Stassen AP, Diroma MA, Navarro-Gomez D, Yeske P, Bai R, Boles RG, Brilhante V, Ralph D, DaRe JT, Shelton R, Terry SF, Zhang Z, Copeland WC, van Oven M, Prokisch H, Wallace DC, Attimonelli M, Krotoski D, Zuchner S, Gai X; MSeqDR Consortium Participants; MSeqDR Consortium participants: Sherri Bale, Jirair Bedoyan, Doron Behar, Penelope Bonnen, Lisa Brooks, Claudia Calabrese, Sarah Calvo, Patrick Chinnery, John Christodoulou, Deanna Church,; Rosanna Clima, Bruce H. Cohen, Richard G. Cotton, IFM de Coo, Olga Derbenevoa, Johan T. den Dunnen, David Dimmock, Gregory Enns, Giuseppe Gasparre,; Amy Goldstein, Iris Gonzalez, Katrina Gwinn, Sihoun Hahn, Richard H. Haas, Hakon Hakonarson, Michio Hirano, Douglas Kerr, Dong Li, Maria Lvova, Finley Macrae, Donna Maglott, Elizabeth McCormick, Grant Mitchell, Vamsi K. Mootha, Yasushi Okazaki,; Aurora Pujol, Melissa Parisi, Juan Carlos Perin, Eric A. Pierce, Vincent Procaccio, Shamima Rahman, Honey Reddi, Heidi Rehm, Erin Riggs, Richard Rodenburg, Yaffa Rubinstein, Russell Saneto, Mariangela Santorsola, Curt Scharfe,; Claire Sheldon, Eric A. Shoubridge, Domenico Simone, Bert Smeets, Jan A. Smeitink, Christine Stanley, Anu Suomalainen, Mark Tarnopolsky, Isabelle Thiffault, David R. Thorburn, Johan Van Hove, Lynne Wolfe, and Lee-Jun Wong. Falk MJ, et al. Mol Genet Metab. 2015 Mar;114(3):388-96. doi: 10.1016/j.ymgme.2014.11.016. Epub 2014 Dec 4. Mol Genet Metab. 2015. PMID: 25542617 Free PMC article. Review.
  • Workflow based framework for life science informatics.
    Tiwari A, Sekhar AK. Tiwari A, et al. Comput Biol Chem. 2007 Oct;31(5-6):305-19. doi: 10.1016/j.compbiolchem.2007.08.009. Epub 2007 Aug 19. Comput Biol Chem. 2007. PMID: 17931570 Review.
See all similar articles


    1. Mardis ER. The \$1,000 genome, the \$100,000 analysis? Genome Medicine. 2010;2(11):84. - PMC - PubMed
    1. NIH Common Fund: New Models of Data Stewardship—Data Commons Pilot. Accessed 09 Jan 2020.
    1. Wilkinson MD, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 2016;3:160018. - PMC - PubMed
    1. National Research Council. Barriers to the use of Databases. In: Pool, R., Esnayra, J. (eds.) Bioinformatics: converting data to knowledge. Washington, DC: The National Academies Press; 2000. 10.17226/9990.
    1. Maughan PJ, Lee R, Walstead R, Vickerstaff RJ, Fogarty MC, Brouwer CR, Reid RR, Jay JJ, Bekele WA, Jackson EW, Tinker NA, Langdon T, Schlueter JA, Jellen EN. Genomic insights from the first chromosome-scale assemblies of oat (Avena spp.) diploid species. BMC Biol. 2019;17(1):92. - PMC - PubMed

LinkOut - more resources