The relative abundance of cancer-associated fibroblast (CAF) subtypes influences a tumor's response to treatment, especially immunotherapy. However, the gene expression signatures associated with these CAF subtypes have yet to realize their potential as clinical biomarkers. Here, we describe an interpretable machine learning approach, additive multiple instance learning (aMIL), to predict bulk gene expression signatures from hematoxylin and eosin-stained whole-slide images, focusing on an immunosuppressive LRRC15+ CAF-enriched TGFβ-CAF signature. aMIL models accurately predicted TGFβ-CAF across various cancer types. Tissue regions contributing most highly to slide-level predictions of TGFβ-CAF were evaluated by machine learning models characterizing spatial distributions of diverse cell and tissue types, stromal subtypes, and nuclear morphology. In breast cancer, regions contributing most to TGFβ-CAF-high predictions ("excitatory") were localized to cancer stroma with high fibroblast density and mature collagen fibers. Regions contributing most to TGFβ-CAF-low predictions ("inhibitory") were localized to cancer epithelium and densely inflamed stroma. Fibroblast and lymphocyte nuclear morphology also differed between excitatory and inhibitory regions. Thus, aMIL enables a data-driven link between histologic features and transcription, offering biological interpretability beyond typical black-box models.
Keywords: biomarkers; computational pathology; gene expression signatures; machine learning.
Copyright © 2025 The Authors. Published by Elsevier Inc. All rights reserved.