Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease

Am J Physiol Heart Circ Physiol. 2018 Oct 1;315(4):H910-H924. doi: 10.1152/ajpheart.00175.2018. Epub 2018 May 18.


Extracellular matrix (ECM) proteins have been shown to play important roles regulating multiple biological processes in an array of organ systems, including the cardiovascular system. Using a novel bioinformatics text-mining tool, we studied six categories of cardiovascular disease (CVD), namely, ischemic heart disease, cardiomyopathies, cerebrovascular accident, congenital heart disease, arrhythmias, and valve disease, anticipating novel ECM protein-disease and protein-protein relationships hidden within vast quantities of textual data. We conducted a phrase-mining analysis, delineating the relationships of 709 ECM proteins with the 6 groups of CVDs reported in 1,099,254 abstracts. The technology pipeline known as Context-Aware Semantic Online Analytical Processing was applied to semantically rank the association of proteins to each CVD and all six CVDs, performing analyses to quantify each protein-disease relationship. We performed principal component analysis and hierarchical clustering of the data, where each protein was visualized as a six-dimensional vector. We found that ECM proteins display variable degrees of association with the six CVDs; certain CVDs share groups of associated proteins, whereas others have divergent protein associations. We identified 82 ECM proteins sharing associations with all 6 CVDs. Our bioinformatics analysis ascribed distinct ECM pathways (via Reactome) from this subset of proteins, namely, insulin-like growth factor regulation and interleukin-4 and interleukin-13 signaling, suggesting their contribution to the pathogenesis of all six CVDs. Finally, we performed hierarchical clustering analysis and identified protein clusters predominantly associated with a targeted CVD; analyses of these proteins revealed unexpected insights underlying the key ECM-related molecular pathogenesis of each CVD, including virus assembly and release in arrhythmias. NEW & NOTEWORTHY The present study is the first application of a text-mining algorithm to characterize the relationships of 709 extracellular matrix-related proteins with 6 categories of cardiovascular disease described in 1,099,254 abstracts. Our analysis informed unexpected extracellular matrix functions, pathways, and molecular relationships implicated in the six cardiovascular diseases.

Keywords: big data; machine learning; relationship discovery; text mining.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Big Data
  • Biomarkers / metabolism
  • Cardiovascular Diseases / metabolism*
  • Data Mining / methods*
  • Databases, Factual
  • Extracellular Matrix / metabolism*
  • Extracellular Matrix Proteins / metabolism*
  • Humans
  • Machine Learning*
  • Pattern Recognition, Automated / methods*
  • Principal Component Analysis
  • Protein Interaction Maps


  • Biomarkers
  • Extracellular Matrix Proteins