Machine learning-based genome-wide interrogation of somatic copy number aberrations in circulating tumor DNA for early detection of hepatocellular carcinoma

EBioMedicine. 2020 Jun;56:102811. doi: 10.1016/j.ebiom.2020.102811. Epub 2020 Jun 5.


Background: DNAs released from tumor cells into blood (circulating tumor DNAs, ctDNAs) carry tumor-specific genomic aberrations, providing a non-invasive means for cancer detection. In this study, we aimed to leverage somatic copy number aberration (SCNA) in ctDNA to develop assays to detect early-stage HCCs.

Methods: We conducted low-depth whole-genome sequencing (WGS) to profile SCNAs in 384 plasma samples of hepatitis B virus (HBV)-related HCC and cancer-free HBV patients, using one discovery and two validation cohorts. To fully capture the robust signals of WGS data from the complete genome, we developed a machine learning-based statistical model that is focused on detection accuracy in early-stage HCC.

Findings: We built the model using a discovery cohort of 209 patients, achieving an overall area under curve (AUC) of 0.893, with 0.874 for early-stage (Barcelona clinical liver cancer [BCLC] stage 0-A) and 0.933 for advanced-stage (BCLC stage B-D). The performance of the model was then assessed in two validation cohorts (76 and 99 patients) that only consisted of patients with stage 0-A HCC. Our model exhibited a robust predictive performance, with an AUC of 0.920 and 0.812 for the two validation cohorts. Further analyses showed the impact of tumor sample heterogeneity in model training on detecting early-stage tumors, and a refined model addressing the heterogeneity in the discovery cohort significantly increased model performance in validation.

Interpretation: We developed an SCNA-based, machine learning-driven model in the non-invasive detection of early-stage HCC in HBV patients and demonstrated its performance through strict independent validations.

Keywords: Copy number aberration (CNA); Early detection; Hepatocellular carcinoma (HCC); Machine learning.

MeSH terms

  • Adult
  • Area Under Curve
  • Carcinoma, Hepatocellular / diagnosis*
  • Carcinoma, Hepatocellular / genetics
  • Carcinoma, Hepatocellular / pathology
  • Circulating Tumor DNA / genetics*
  • DNA Copy Number Variations*
  • Early Detection of Cancer
  • Female
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Liver Neoplasms / diagnosis*
  • Liver Neoplasms / genetics
  • Liver Neoplasms / pathology
  • Machine Learning
  • Male
  • Middle Aged
  • Neoplasm Staging
  • Whole Genome Sequencing


  • Circulating Tumor DNA