Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Mar 16;9:4.
doi: 10.3389/fninf.2015.00004. eCollection 2015.

A Scalable Neuroinformatics Data Flow for Electrophysiological Signals Using MapReduce

Affiliations
Free PMC article

A Scalable Neuroinformatics Data Flow for Electrophysiological Signals Using MapReduce

Catherine Jayapandian et al. Front Neuroinform. .
Free PMC article

Abstract

Data-driven neuroscience research is providing new insights in progression of neurological disorders and supporting the development of improved treatment approaches. However, the volume, velocity, and variety of neuroscience data generated from sophisticated recording instruments and acquisition methods have exacerbated the limited scalability of existing neuroinformatics tools. This makes it difficult for neuroscience researchers to effectively leverage the growing multi-modal neuroscience data to advance research in serious neurological disorders, such as epilepsy. We describe the development of the Cloudwave data flow that uses new data partitioning techniques to store and analyze electrophysiological signal in distributed computing infrastructure. The Cloudwave data flow uses MapReduce parallel programming algorithm to implement an integrated signal data processing pipeline that scales with large volume of data generated at high velocity. Using an epilepsy domain ontology together with an epilepsy focused extensible data representation format called Cloudwave Signal Format (CSF), the data flow addresses the challenge of data heterogeneity and is interoperable with existing neuroinformatics data representation formats, such as HDF5. The scalability of the Cloudwave data flow is evaluated using a 30-node cluster installed with the open source Hadoop software stack. The results demonstrate that the Cloudwave data flow can process increasing volume of signal data by leveraging Hadoop Data Nodes to reduce the total data processing time. The Cloudwave data flow is a template for developing highly scalable neuroscience data processing pipelines using MapReduce algorithms to support a variety of user applications.

Keywords: MapReduce; cloudwave signal format; electrophysiological signal data; epilepsy and seizure ontology; epilepsy research.

Figures

Figure 1
Figure 1
Data acquisition and management in Epilepsy Monitoring Unit (EMU). Multiple modalities of data are generated during patient stay in an EMU, including electrophysiological signal data. Three neuroinformatics tools have been developed as part of the PRISM project: (a) OPIC for patient information collection, (b) EpiDEA for clinical text processing of discharge summaries and related documents; and (c) Cloudwave for managing signal data. The Cloudwave data flow uses MapReduce and distributed file system to store and process signal data for scalability. The data processed and generated from the Cloudwave data flow is consumed by a Web browser-based signal visualization interface.
Figure 2
Figure 2
Cloudwave data flow. EDF files generated by signal recording instruments are deposited in pre-specified folder, which is regularly polled by a daemon process. If one or more EDF files are detected, the Cloudwave data flow follows multiple steps: (1) in the pre-processing phase signal data is partitioned into fragments of specific time duration (epoch) and stored in a new self-descriptive structure called EDFSegment, (2) in the second phase, an EDFSegment method is invoked to store signal data in channel-oriented order for easier composition into signal montages, (3) in the third phase, the signal data are converted from binary to short integer format and from digital to physical values for use by the Cloudwave signal visualization interface, (4) in the third phase, the EDFSegments are transformed in the Cloudwave Signal Format (CSF) data objects, which are aggregated based on original EDF file identifier in the last phase. The CSF data objects can be efficiently transferred over the network to the Cloudwave signal visualization module as compared to the original EDF files.
Figure 3
Figure 3
EDFSegment and CSF object. During the pre-processing phase, the signal data from study metadata, channel-specific metadata from EDF file is integrated with clinical event annotations and stored with partitioned signal data (fragments corresponding to 30 s epochs). The total number of fragments per EDFSegment is a configurable parameter in Cloudwave data flow that can be adjusted according to available memory in the Hadoop Data Data Nodes. In the first phase of the MapReduce algorithm, the signal data stored as EDF Data Records are transformed into channel oriented data. After additional data processing steps to support the Cloudwave signal visualization module, the CSF data objects are created using the signal data partitioning scheme of the EDFSegments.
Figure 4
Figure 4
Epilepsy and Seizure Ontology (EpSO) class hierarchy. EpSO models 1350 classes related to epilepsy neurological disorders, including the clinical event terms used to annotate signal data. The class hierarchy of EpSO allows software application to use reasoning to improve the quality of query results and is used in cohort query user interface called Multi-Modality Epilepsy Data Capture and Integration System (MEDCIS). The EpSO classes are used as reference terminology for signal data annotation in the Cloudwave data flow, which reduces terminological heterogeneity and facilitates data sharing and integration across epilepsy informatics tools.
Figure 5
Figure 5
Cloudwave data flow evaluation results with variable-sized signal data fragments. The number of signal data fragments in an EDFSegment object can be modified according to available memory in the Hadoop Data Nodes. The results of this experiment demonstrate that for 25 GB of EDF files processed on 15 and 30 Data Nodes, the change in total number of fragments per EDFSegment does not lead to significant variations in performance of the Cloudwave data flow. This parameter can be tuned to get maximum improvement in performance of the Cloudwave data flow, for example 12 and 14 signal fragments per EDFSegment object are optimal values for 15 and 30 Hadoop Data Nodes respectively.
Figure 6
Figure 6
Scalability of the Cloudwave data flow with increasing size of data. The Cloudwave data flow effectively uses multiple Hadoop Data Nodes to scale with increasing amount of data and consistently reduces the total time taken to process the data. The results also demonstrate that the data partitioning approach allows the Cloudwave data flow to flexibly modify the volume of signal data per EDFSegment (total number of signal fragments) without adversely affecting time performance (EDFSegments with 8 (A) and 16 (B) fragments have comparable performance).

Similar articles

See all similar articles

Cited by 3 articles

References

    1. Agrawal D., Bernstein P., Bertino E., Davidson S., Dayal S., Franklin M., et al. (2012). “Challenges and opportunities with big data,” (2011). Cyber Center Technical Reports. Paper 1. Available online at: http://docs.lib.purdue.edu/cctech/1
    1. Ashburner M., Ball C. A., Blake J. A., Botstein D., Butler H., Cherry J. M., et al. . (2000). Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat. Genet. 25, 25–29. 10.1038/75556 - DOI - PMC - PubMed
    1. Bartolomei F., Chauvel P., Wendling F. (2008). Epileptogenicity of brain structures in human temporal lobe epilepsy: a quantified study from intracerebral EEG. Brain 131, 1818–1830. 10.1093/brain/awn111 - DOI - PubMed
    1. Berg A. T., Berkovic S. F., Brodie M. J., Buchhalter J., Cross J. H., van Emde Boas W., et al. . (2010). Revised terminology and concepts for organization of seizures and epilepsies: report of the ILAE commission on classification and terminology, 2005–2009. Epilepsia 51, 676–685. 10.1111/j.1528-1167.2010.02522.x - DOI - PubMed
    1. Bodenreider O., Burgun A. (2009). “Desiderata for an ontology of diseases for the annotation of biological datasets,” in First International Conference on Biomedical Ontology (ICBO 2009) (University at Buffalo, NY: ), 39–42.

LinkOut - more resources

Feedback