Detecting anomalous referencing patterns in PubMed papers suggestive of author-centric reference list manipulation

Jonathan D Wren; Constantin Georgescu

doi:10.1007/s11192-022-04503-6

Detecting anomalous referencing patterns in PubMed papers suggestive of author-centric reference list manipulation

Scientometrics. 2022 Oct;127(10):5753-5771. doi: 10.1007/s11192-022-04503-6. Epub 2022 Sep 8.

Authors

Jonathan D Wren^{1

2

3

4}, Constantin Georgescu¹

Affiliations

¹ Genes and Human Disease Research Program, Oklahoma Medical Research Foundation, 825 N.E. 13th Street, Oklahoma City, OK 73104-5005, USA.
² Biochemistry and Molecular Biology Department, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA.
³ Stephenson Cancer Center, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA.
⁴ Department of Geriatric Medicine, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA.

Abstract

Although citations are used as a quantifiable, objective metric of academic influence, references could be added to a paper solely to inflate the perceived influence of a body of research. This reference list manipulation (RLM) could take place during the peer-review process, or prior to it. Surveys have estimated how many people may have been affected by coercive RLM at one time or another, but it is not known how many authors engage in RLM, nor to what degree. By examining a subset of active, highly published authors (n = 20,803) in PubMed, we find the frequency of non-self-citations (NSC) to one author coming from a single paper approximates Zipf's law. Author-centric deviations from it are approximately normally distributed, permitting deviations to be quantified statistically. Framed as an anomaly detection problem, statistical confidence increases when an author is an outlier by multiple metrics. Anomalies are not proof of RLM, but authors engaged in RLM will almost unavoidably create anomalies. We find the NSC Gini Index correlates highly with anomalous patterns across multiple "red flags", each suggestive of RLM. Between 81 (0.4%, FDR < 0.05) and 231 (1.1%, FDR < 0.10) authors are outliers on the curve, suggestive of chronic, repeated RLM. Approximately 16% of all authors may have engaged in RLM to some degree. Authors who use 18% or more of their references for self-citation are significantly more likely to have NSC Gini distortions, suggesting a potential willingness to coerce others to cite them.

Keywords: Citation analysis; Citation behavior; Scientific ethics.

Grants and funding

P30 AG050911/AG/NIA NIH HHS/United States