Detecting anomalous referencing patterns in PubMed papers suggestive of author-centric reference list manipulation

Scientometrics. 2022 Oct;127(10):5753-5771. doi: 10.1007/s11192-022-04503-6. Epub 2022 Sep 8.

Abstract

Although citations are used as a quantifiable, objective metric of academic influence, references could be added to a paper solely to inflate the perceived influence of a body of research. This reference list manipulation (RLM) could take place during the peer-review process, or prior to it. Surveys have estimated how many people may have been affected by coercive RLM at one time or another, but it is not known how many authors engage in RLM, nor to what degree. By examining a subset of active, highly published authors (n = 20,803) in PubMed, we find the frequency of non-self-citations (NSC) to one author coming from a single paper approximates Zipf's law. Author-centric deviations from it are approximately normally distributed, permitting deviations to be quantified statistically. Framed as an anomaly detection problem, statistical confidence increases when an author is an outlier by multiple metrics. Anomalies are not proof of RLM, but authors engaged in RLM will almost unavoidably create anomalies. We find the NSC Gini Index correlates highly with anomalous patterns across multiple "red flags", each suggestive of RLM. Between 81 (0.4%, FDR < 0.05) and 231 (1.1%, FDR < 0.10) authors are outliers on the curve, suggestive of chronic, repeated RLM. Approximately 16% of all authors may have engaged in RLM to some degree. Authors who use 18% or more of their references for self-citation are significantly more likely to have NSC Gini distortions, suggesting a potential willingness to coerce others to cite them.

Keywords: Citation analysis; Citation behavior; Scientific ethics.