Using machine learning to disentangle homonyms in large text corpora

Uri Roll; Ricardo A Correia; Oded Berger-Tal

doi:10.1111/cobi.13044

Using machine learning to disentangle homonyms in large text corpora

Conserv Biol. 2018 Jun;32(3):716-724. doi: 10.1111/cobi.13044. Epub 2018 Mar 10.

Authors

Uri Roll^{1

2}, Ricardo A Correia^{2

3

4}, Oded Berger-Tal¹

Affiliations

¹ Mitrani Department of Desert Ecology, The Jacob Blaustein Institutes for Desert Research, Ben-Gurion University of the Negev, Midreshet Ben-Gurion, 8499000, Israel.
² School of Geography and the Environment University of Oxford, OX13QY, Oxford, U.K.
³ Institute of Biological Sciences and Health, Federal University of Alagoas, Campus A. C. Simões, Av. Lourival Melo Mota, s/n Tabuleiro dos Martins, AL, Maceió, Brazil.
⁴ DBIO & CESAM-Centre for Environmental and Marine Studies, University of Aveiro, Aveiro, Portugal.

PMID: 29086438
DOI: 10.1111/cobi.13044

Abstract

Systematic reviews are an increasingly popular decision-making tool that provides an unbiased summary of evidence to support conservation action. These reviews bridge the gap between researchers and managers by presenting a comprehensive overview of all studies relating to a particular topic and identify specifically where and under which conditions an effect is present. However, several technical challenges can severely hinder the feasibility and applicability of systematic reviews, for example, homonyms (terms that share spelling but differ in meaning). Homonyms add noise to search results and cannot be easily identified or removed. We developed a semiautomated approach that can aid in the classification of homonyms among narratives. We used a combination of automated content analysis and artificial neural networks to quickly and accurately sift through large corpora of academic texts and classify them to distinct topics. As an example, we explored the use of the word reintroduction in academic texts. Reintroduction is used within the conservation context to indicate the release of organisms to their former native habitat; however, a Web of Science search for this word returned thousands of publications in which the term has other meanings and contexts. Using our method, we automatically classified a sample of 3000 of these publications with over 99% accuracy, relative to a manual classification. Our approach can be used easily with other homonyms and can greatly facilitate systematic reviews or similar work in which homonyms hinder the harnessing of large text corpora. Beyond homonyms we see great promise in combining automated content analysis and machine-learning methods to handle and screen big data for relevant information in conservation science.

Keywords: análisis automatizado de contenido; automated content analysis; big data; datos grandes; homographs; homógrafos; minería de textos; neural networks; redes neurales; reintroducciones; reintroductions; revisiones sistemáticas; systematic reviews; text mining; 自动化内容分析, 大数据, 同形异义词, 神经网络, 重引入, 系统综述, 文本挖掘.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Conservation of Natural Resources*
Humans
Machine Learning*
Research Personnel