Contextual Phenotyping of Pediatric Sepsis Cohort Using Large Language Models

AMIA Annu Symp Proc. 2025 May 22:2024:929-938. eCollection 2024.

Abstract

The clustering of patient subgroups is essential for personalized care and efficient use of resources. Traditional clustering methods struggle with high-dimensional heterogeneous healthcare data and lack contextual understanding. This study evaluates clustering based on the Large Language Model (LLM) against classical methods using a pediatric sepsis dataset from a low-income country (LIC), containing 2,686 records with 28 numerical variables and 119 categorical variables. Patient records were serialized into text with and without a clustering objective. Embeddings were generated using quantized LLAMA 3.1 8B, DeepSeek-R1-Distill-Llama-8B with low-rank adaptation(LoRA), and Stella-En-400M-V5 models. K-means clustering was applied to these embeddings. Classical comparisons included K-Medoids clustering on UMAP and FAMD-reduced mixed data. Silhouette scores and statistical tests evaluated the quality and distinctiveness of the cluster. Stella-En-400M-V5 achieved the highest Silhouette Score (0.86). LLAMA 3.1 8B with the clustering objective performed better with a higher number of clusters, identifying subgroups with distinct nutritional, clinical, and socioeconomic profiles. LLM-based methods outperformed classical techniques by capturing richer context and prioritizing key features. These results highlight the potential of LLMs for contextual phenotyping and informed decision making in resource-limited settings.

MeSH terms

  • Child
  • Cluster Analysis
  • Cohort Studies
  • Humans
  • Large Language Models
  • Natural Language Processing*
  • Phenotype
  • Sepsis* / classification