Motivation: Techniques that are capable of automatically analyzing the information structure of scientific articles could be highly useful for improving information access to biomedical literature. However, most existing approaches rely on supervised machine learning (ML) and substantial labeled data that are expensive to develop and apply to different sub-fields of biomedicine. Recent research shows that minimal supervision is sufficient for fairly accurate information structure analysis of biomedical abstracts. However, is it realistic for full articles given their high linguistic and informational complexity? We introduce and release a novel corpus of 50 biomedical articles annotated according to the Argumentative Zoning (AZ) scheme, and investigate active learning with one of the most widely used ML models-Support Vector Machines (SVM)-on this corpus. Additionally, we introduce two novel applications that use AZ to support real-life literature review in biomedicine via question answering and summarization.
Results: We show that active learning with SVM trained on 500 labeled sentences (6% of the corpus) performs surprisingly well with the accuracy of 82%, just 2% lower than fully supervised learning. In our question answering task, biomedical researchers find relevant information significantly faster from AZ-annotated than unannotated articles. In the summarization task, sentences extracted from particular zones are significantly more similar to gold standard summaries than those extracted from particular sections of full articles. These results demonstrate that active learning of full articles' information structure is indeed realistic and the accuracy is high enough to support real-life literature review in biomedicine.
Availability: The annotated corpus, our AZ classifier and the two novel applications are available at http://www.cl.cam.ac.uk/yg244/12bioinfo.html