Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems

G Andrade; R Ferreira; George Teodoro; Leonardo Rocha; Joel H Saltz; Tahsin Kurc

doi:10.1109/SBAC-PAD.2014.15

Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems

Proc Symp Comput Archit High Perform Comput. 2014 Oct:2014:89-96. doi: 10.1109/SBAC-PAD.2014.15.

Authors

G Andrade¹, R Ferreira¹, George Teodoro², Leonardo Rocha³, Joel H Saltz⁴, Tahsin Kurc⁴

Affiliations

¹ Federal University of Minas Gerais.
² University of Brasília.
³ Federal University of São João del Rei.
⁴ Stony Brook University.

Abstract

High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical data flow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finer-grain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time - HEFT, in cooperative executions using CPUs, GPUs, and MICs. We also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales.

Abstract

Grants and funding