Using citation networks to evaluate the impact of text length on keyword extraction

PLoS One. 2023 Nov 27;18(11):e0294500. doi: 10.1371/journal.pone.0294500. eCollection 2023.

Abstract

The identification of key concepts within unstructured data is of paramount importance in practical applications. Despite the abundance of proposed methods for extracting primary topics, only a few works investigated the influence of text length on the performance of keyword extraction (KE) methods. Specifically, many studies lean on abstracts and titles for content extraction from papers, leaving it uncertain whether leveraging the complete content of papers can yield consistent results. Hence, in this study, we employ a network-based approach to evaluate the concordance between keywords extracted from abstracts and those from the entire papers. Community detection methods are utilized to identify interconnected papers in citation networks. Subsequently, paper clusters are formed to identify salient terms within each cluster, employing a methodology akin to the term frequency-inverse document frequency (tf-idf) approach. Once each cluster has been endowed with its distinctive set of key terms, these selected terms are employed to serve as representative keywords at the paper level. The top-ranked words at the cluster level, which also appear in the abstract, are chosen as keywords for the paper. Our findings indicate that although various community detection methods used in KE yield similar levels of accuracy. Notably, text clustering approaches outperform all citation-based methods, while all approaches yield relatively low accuracy values. We also identified a lack of concordance between keywords extracted from the abstracts and those extracted from the corresponding full-text source. Considering that citations and text clustering yield distinct outcomes, combining them in hybrid approaches could offer improved performance.

Grants and funding

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. Thiago C. Silva (Grant no. 308171/2019-5, 408546/2018-2) gratefully acknowledges financial support from the CNPq foundation. Diego R. Amancio acknowledges financial support from São Paulo Research Foundation (FAPESP Grant no. 2020/06271-0) and CNPq-Brazil (Grant no. 304026/2018-2 and 311074/2021- 9). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.