Objective: Non-coding RNAs (ncRNAs) are involved in various important biological processes, relevant disease biomarkers and therapeutic agents. However, information remains sparsely distributed, mostly in articles. It is then of pivotal importance to aggregate and normalize the existing information. Natural Language Processing (NLP) text mining can generate relational corpora. Large Language Models (LLMs) show great capabilities in general task-solving without fine-tuning with large datasets. The aim of this work was to develop a methodology to extract ncRNA-phenotype relations from scientific articles, using a combination of NLP and LLMs.
Methods: We developed a NLP pipeline to aggregate and normalize data from five ncRNA-disease databases. This dataset was used to generate a corpus of ncRNA-phenotype relations from scientific articles, using Distant Supervision Relation Extraction (DSRE). Finally, we used Large Language Models (LLMs) to develop a Relation Extraction (RE) methodology, using the validated subset of the corpus to evaluate performance.
Results: A high-fidelity ncRNA-phenotype relation dataset, consisting of 214,300 relations, was created by the aggregation and normalization of five comprehensive ncRNA-disease databases. We generated a ncRNA-phenotype relational corpus (ncoRP) using DSRE, containing 21,608 annotated articles, 2835 unique ncRNAs, 1118 unique phenotypes and 35,295 unique relations, with a precision of 0.761 and F1-score of 0.593. We developed and applied a LLM RE methodology, achieving an F1-score of 0.978 by combining the RE task with a preceding sentence filtering task and applying prompting principles such as in-context learning (ICL) and Chain-of-Thought self-explanation.
Conclusions: We successfully created a high-fidelity, normalized, ncRNA-phenotype relation dataset from five ncRNA annotation databases and a relational corpus, containing a large number of annotated sentences expressing ncRNA-phenotype relations. The methodology developed in this work, combining LLMs with DSRE, was able to achieve a high F1-score in the RE task, showing promise for the automatic extraction of ncRNA-phenotype relations from scientific articles. We expect the dataset and corpus to be of great use for the development of future NLP models and tools for the study of ncRNAs, and our methodology to be broadly applicable to other similar tasks, such as the extraction of gene-disease relations.
Keywords: Distant Supervision; Large Language Models; Non-coding RNAs; Relation Extraction; Text mining.
Copyright © 2026 Elsevier Inc. All rights reserved.