Sustainable data and metadata management at the BD2K-LINCS Data Coordination and Integration Center

Vasileios Stathias; Amar Koleti; Dušica Vidović; Daniel J Cooper; Kathleen M Jagodnik; Raymond Terryn; Michele Forlin; Caty Chung; Denis Torre; Nagi Ayad; Mario Medvedovic; Avi Ma'ayan; Ajay Pillai; Stephan C Schürer

doi:10.1038/sdata.2018.117

Sustainable data and metadata management at the BD2K-LINCS Data Coordination and Integration Center

Sci Data. 2018 Jun 19:5:180117. doi: 10.1038/sdata.2018.117.

Authors

Vasileios Stathias^{1

2

3}, Amar Koleti^{1

4}, Dušica Vidović^{1

3

4}, Daniel J Cooper^{1

3}, Kathleen M Jagodnik^{1

5}, Raymond Terryn^{1

3

4}, Michele Forlin^{1

3

4}, Caty Chung^{1

4}, Denis Torre^{1

5}, Nagi Ayad⁶, Mario Medvedovic^{1

7}, Avi Ma'ayan^{1

5}, Ajay Pillai⁸, Stephan C Schürer^{1

3

4}

Affiliations

¹ BD2K-LINCS Data Coordination and Integration Center, University of Miami, Miami, FL 33136, USA.
² Department of Human Genetics and Genomics, Miller School of Medicine, University of Miami, Miami, FL 33136, USA.
³ Department of Molecular and Cellular Pharmacology, Miller School of Medicine, University of Miami, Miami, FL 33136, USA.
⁴ Center for Computational Science, University of Miami, Miami, FL 33146, USA.
⁵ Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
⁶ Department of Psychiatry and Behavioral Sciences, University of Miami, Miami, FL 33136, USA.
⁷ Division of Biostatistics and Bioinformatics, Department of Environmental Health, University of Cincinnati, Cincinnati, OH 45221, USA.
⁸ Division of Genome Sciences, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20891, USA.

Abstract

The NIH-funded LINCS Consortium is creating an extensive reference library of cell-based perturbation response signatures and sophisticated informatics tools incorporating a large number of perturbagens, model systems, and assays. To date, more than 350 datasets have been generated including transcriptomics, proteomics, epigenomics, cell phenotype and competitive binding profiling assays. The large volume and variety of data necessitate rigorous data standards and effective data management including modular data processing pipelines and end-user interfaces to facilitate accurate and reliable data exchange, curation, validation, standardization, aggregation, integration, and end user access. Deep metadata annotations and the use of qualified data standards enable integration with many external resources. Here we describe the end-to-end data processing and management at the DCIC to generate a high-quality and persistent product. Our data management and stewardship solutions enable a functioning Consortium and make LINCS a valuable scientific resource that aligns with big data initiatives such as the BD2K NIH Program and concords with emerging data science best practices including the findable, accessible, interoperable, and reusable (FAIR) principles.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Animals
Data Curation*
Datasets as Topic
Humans
Information Storage and Retrieval
Metadata*
National Institutes of Health (U.S.)
United States

Abstract

Publication types

MeSH terms

Grants and funding