TRACE: applying AI language models to extract ancestry information from curated biomedical literature

Front Digit Health. 2025 Sep 19:7:1608370. doi: 10.3389/fdgth.2025.1608370. eCollection 2025.

Abstract

Introduction: Ancestry reporting is essential to ensure transparency and proper representation in biomedical studies. However, manually extracting this information from study texts is time-consuming and inefficient. In this paper, we present TRACE (Tool for Researching Ancestry and Cell Extraction), powered by GPT-4 and web-crawling, to automate ancestry identification by detecting cell lines or cultures in texts and tracing their ancestry.

Methods: TRACE extracts cell lines and primary cultures from research articles and follows web sources to determine their ancestry. We compared TRACE's outputs to a manually generated database to confirm its performance in identifying and verifying ancestry information.

Results: The results reveal an overrepresentation of European/White samples and significant underreporting. TRACE enables large-scale, systematic ancestry analysis-a valuable resource for researchers and agencies assessing biases in sample selection.

Conclusions: As an open-source tool, TRACE it facilitates broader use to evaluate and improve ancestry representation in biomedical research.

Keywords: AI language models, open source tool; ancestry representation; automated text mining; biomedical research equity; cell line identification.