Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease

Am J Physiol Heart Circ Physiol. 2018 Oct 1;315(4):H910-H924. doi: 10.1152/ajpheart.00175.2018. Epub 2018 May 18.

Abstract

Extracellular matrix (ECM) proteins have been shown to play important roles regulating multiple biological processes in an array of organ systems, including the cardiovascular system. Using a novel bioinformatics text-mining tool, we studied six categories of cardiovascular disease (CVD), namely, ischemic heart disease, cardiomyopathies, cerebrovascular accident, congenital heart disease, arrhythmias, and valve disease, anticipating novel ECM protein-disease and protein-protein relationships hidden within vast quantities of textual data. We conducted a phrase-mining analysis, delineating the relationships of 709 ECM proteins with the 6 groups of CVDs reported in 1,099,254 abstracts. The technology pipeline known as Context-Aware Semantic Online Analytical Processing was applied to semantically rank the association of proteins to each CVD and all six CVDs, performing analyses to quantify each protein-disease relationship. We performed principal component analysis and hierarchical clustering of the data, where each protein was visualized as a six-dimensional vector. We found that ECM proteins display variable degrees of association with the six CVDs; certain CVDs share groups of associated proteins, whereas others have divergent protein associations. We identified 82 ECM proteins sharing associations with all 6 CVDs. Our bioinformatics analysis ascribed distinct ECM pathways (via Reactome) from this subset of proteins, namely, insulin-like growth factor regulation and interleukin-4 and interleukin-13 signaling, suggesting their contribution to the pathogenesis of all six CVDs. Finally, we performed hierarchical clustering analysis and identified protein clusters predominantly associated with a targeted CVD; analyses of these proteins revealed unexpected insights underlying the key ECM-related molecular pathogenesis of each CVD, including virus assembly and release in arrhythmias. NEW & NOTEWORTHY The present study is the first application of a text-mining algorithm to characterize the relationships of 709 extracellular matrix-related proteins with 6 categories of cardiovascular disease described in 1,099,254 abstracts. Our analysis informed unexpected extracellular matrix functions, pathways, and molecular relationships implicated in the six cardiovascular diseases.

Keywords: big data; machine learning; relationship discovery; text mining.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Big Data
  • Biomarkers / metabolism
  • Cardiovascular Diseases / metabolism*
  • Data Mining / methods*
  • Databases, Factual
  • Extracellular Matrix / metabolism*
  • Extracellular Matrix Proteins / metabolism*
  • Humans
  • Machine Learning*
  • Pattern Recognition, Automated / methods*
  • Principal Component Analysis
  • Protein Interaction Maps

Substances

  • Biomarkers
  • Extracellular Matrix Proteins