Objective: We aim to build an informatics methodology capable of identifying statistically significant associations between the clinical findings of non-small cell lung cancer (NSCLC) recorded in patient pathology reports and the various clinically actionable genetic mutations identified from next-generation sequencing (NGS) of patient tumor samples.
Methods: We built an information extraction and analysis pipeline to identify the associations between clinical findings in the pathology reports of patients and corresponding genetic mutations. Our pipeline leverages natural language processing (NLP) techniques, large biomedical terminologies, semantic similarity measures, and clustering methods to extract clinical concepts in freetext from patient pathology reports and group them as salient findings.
Results: In this study, we developed and applied our methodology to lobectomy surgical pathology reports of 142 NSCLC patients who underwent NGS testing and who had mutations in 4 oncogenes with clinical ramifications for NSCLC treatment (EGFR, KRAS, BRAF, and PIK3CA). Our approach identified 732 distinct positive clinical concepts in these reports and highlighted multiple findings with strong associations (P-value ≤ 0.05) to mutations in specific genes. Our assessment showed that these associations are consistent with the published literature.
Conclusions: This study provides an automatic pipeline to find statistically significant associations between clinical findings in unstructured text of patient pathology reports and genetic mutations. This approach is generalizable to other types of pathology and clinical reports in various disorders and can provide the first steps toward understanding the role of genetic mutations in the development and treatment of different types of cancer.