Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 4, 1453

MSL: Facilitating Automatic and Physical Analysis of Published Scientific Literature in PDF Format


MSL: Facilitating Automatic and Physical Analysis of Published Scientific Literature in PDF Format

Zeeshan Ahmed et al. F1000Res.


Published scientific literature contains millions of figures, including information about the results obtained from different scientific experiments e.g. PCR-ELISA data, microarray analysis, gel electrophoresis, mass spectrometry data, DNA/RNA sequencing, diagnostic imaging (CT/MRI and ultrasound scans), and medicinal imaging like electroencephalography (EEG), magnetoencephalography (MEG), echocardiography (ECG), positron-emission tomography (PET) images. The importance of biomedical figures has been widely recognized in scientific and medicine communities, as they play a vital role in providing major original data, experimental and computational results in concise form. One major challenge for implementing a system for scientific literature analysis is extracting and analyzing text and figures from published PDF files by physical and logical document analysis. Here we present a product line architecture based bioinformatics tool 'Mining Scientific Literature (MSL)', which supports the extraction of text and images by interpreting all kinds of published PDF files using advanced data mining and image processing techniques. It provides modules for the marginalization of extracted text based on different coordinates and keywords, visualization of extracted figures and extraction of embedded text from all kinds of biological and biomedical figures using applied Optimal Character Recognition (OCR). Moreover, for further analysis and usage, it generates the system's output in different formats including text, PDF, XML and images files. Hence, MSL is an easy to install and use analysis tool to interpret published scientific literature in PDF format.

Keywords: Bioinformatics; Biomedical; Data mining; Images; OCR; PDF; Scientific literature; Text.

Conflict of interest statement

Competing interests: No competing interests were disclosed.


Figure 1.
Figure 1.. Graphical user interfaces of MSL and modular workflow.
This figure shows the graphical user interface and modular workflow of three main components: Text, Image and OCR. A PDF document is input and processed by MSL. Text module provides extracted, searched and marginalized text in reading order, and file attributes. Image component provides the preview of extracted images from the document. OCR component provides extracted text from selected and processed image.
Figure 2.
Figure 2.. Conceptual architecture of MSL and component’s workflow.
This figure shows the conceptual architecture of the MSL application, which consists of three main components: Text, Image and OCR, and nine sub-components: Text File, Image File, Visualize Image, PDF File, LEADTOOLS, XML File, iTEXTSharp, Bytescout, Spire. As figure shows, Text component applies iTEXTSharp, Bytescout, Spire to extract the text from PDF document and write output in XML file. Image components applies Spire to extract images from the PDF document and visualize that using Visualize Image. OCR component applied LEADTOOLS to extract text from images and export that to PDF format.
Figure 3.
Figure 3.. Example: Publication, Figure 1 of (YY et al., 2015).
This figure shows document image analysis, text extraction and PDF conversion. A figure (based on three panels; including two charts, one image and a table) is selected from one of the randomly selected papers 2 6. OCR (LEADTOOLS) is applied to extract and report the text from the figure in simple text form (section: Extracted Text from Figure) and in PDF file with similar margins to the original figure (section: Exported text in PDF format).
Figure 4.
Figure 4.. Example: Publication, Page 1 (Ahmed et al., 2015).
This figure shows document image analysis, text extraction and PDF conversion. First, scanned image based page of one of the randomly selected papers is processed using OCR (LEADTOOLS). Text is extracted from the image and a new PDF is generated, which is based on the text, placed with similar margins to the image file.
Figure 5.
Figure 5.. Screenshot of the all extracted images and generated files (XML and PDF).
This figure shows different files generated during analysis of PDF document. PDF file (top, left) is the actual document, XML file is the structured (tagged) form of extracted text, second PDF file (top, right) is the extracted text from image (see Figure 3) and all other files are extracted image from PDF document.
Figure 6.
Figure 6.. MSL six steps installation process.

Similar articles

See all similar articles

Cited by 1 article


    1. Hunter L, Cohen KB: Biomedical language processing: what’s beyond PubMed? 2006;21(5):589–594. 10.1016/j.molcel.2006.02.012 - DOI - PMC - PubMed
    1. Hadjar K, Rigamonti M, Lalanne D, et al. : Xed: A New Tool for Extracting Hidden Structures from Electronic Documents. In 2004;221–224. 10.1109/DIAL.2004.1263250 - DOI
    1. Sayers EW, Barrett T, Benson DA, et al. : Database resources of the National Center for Biotechnology Information. 2010;38(Database issue):D5–16. 10.1093/nar/gkp967 - DOI - PMC - PubMed
    1. States DJ, Ade AS, Wright ZC, et al. : MiSearch adaptive pubMed search tool. 2009;25(7):974–76. 10.1093/bioinformatics/btn033 - DOI - PMC - PubMed
    1. Poulter GL, Rubin DL, Altman RB, et al. : MScanner: a classifier for retrieving Medline citations. 2008;9(1):108. 10.1186/1471-2105-9-108 - DOI - PMC - PubMed

Grant support

This work was supported by a German Research Foundation grant (DFG-TR34/Z1) to TD.

LinkOut - more resources