Fair molecular feature selection unveils universally tumor lineage-informative methylation sites in colorectal cancer

Bioinformatics. 2025 Jul 1;41(Supplement_1):i150-i159. doi: 10.1093/bioinformatics/btaf237.

Abstract

Motivation: In the era of precision medicine, performing comparative analysis over diverse patient populations is a fundamental step toward tailoring healthcare interventions. However, the aspect of fairly selecting molecular features across multiple patients is often overlooked.

Results: To address this challenge, we introduce FALAFL (FAir muLti-sAmple Feature seLection), an algorithmic approach based on combinatorial optimization. FALAFL is designed to perform feature selection in sequencing data which ensures a balanced selection of features from all patient samples in a cohort. We have applied FALAFL to the problem of selecting lineage-informative CpG sites within a cohort of colorectal cancer patients subjected to low-coverage single-cell methylation sequencing. Our results demonstrate that FALAFL can rapidly and robustly determine the optimal set of CpG sites, which are each well covered by cells across the vast majority of the patients, while ensuring that in each patient, a large proportion of these sites have high read coverage. An analysis of the FALAFL-selected sites reveals that their tumor lineage-informativeness exhibits a strong correlation across a spectrum of diverse patient profiles. Furthermore, these universally lineage-informative sites are highly enriched in the inter-CpG island regions. We hope that FALAFL will aid in designing panels for diagnostic and prognostic purposes and help propel fair data science practices in the exploration of complex diseases.

Availability and implementation: The source code is available at: https://github.com/algo-cancer/FALAFL.

MeSH terms

  • Algorithms
  • Colorectal Neoplasms* / genetics
  • Computational Biology / methods
  • CpG Islands
  • DNA Methylation*
  • Humans
  • Sequence Analysis, DNA / methods