Surgical gesture classification from video and kinematic data

Luca Zappella; Benjamín Béjar; Gregory Hager; René Vidal

doi:10.1016/j.media.2013.04.007

Surgical gesture classification from video and kinematic data

Med Image Anal. 2013 Oct;17(7):732-45. doi: 10.1016/j.media.2013.04.007. Epub 2013 Apr 28.

Authors

Luca Zappella¹, Benjamín Béjar, Gregory Hager, René Vidal

Affiliation

¹ Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218, USA. zappella@cis.jhu.edu

PMID: 23706754
DOI: 10.1016/j.media.2013.04.007

Abstract

Much of the existing work on automatic classification of gestures and skill in robotic surgery is based on dynamic cues (e.g., time to completion, speed, forces, torque) or kinematic data (e.g., robot trajectories and velocities). While videos could be equally or more discriminative (e.g., videos contain semantic information not present in kinematic data), they are typically not used because of the difficulties associated with automatic video interpretation. In this paper, we propose several methods for automatic surgical gesture classification from video data. We assume that the video of a surgical task (e.g., suturing) has been segmented into video clips corresponding to a single gesture (e.g., grabbing the needle, passing the needle) and propose three methods to classify the gesture of each video clip. In the first one, we model each video clip as the output of a linear dynamical system (LDS) and use metrics in the space of LDSs to classify new video clips. In the second one, we use spatio-temporal features extracted from each video clip to learn a dictionary of spatio-temporal words, and use a bag-of-features (BoF) approach to classify new video clips. In the third one, we use multiple kernel learning (MKL) to combine the LDS and BoF approaches. Since the LDS approach is also applicable to kinematic data, we also use MKL to combine both types of data in order to exploit their complementarity. Our experiments on a typical surgical training setup show that methods based on video data perform equally well, if not better, than state-of-the-art approaches based on kinematic data. In turn, the combination of both kinematic and video data outperforms any other algorithm based on one type of data alone.

Keywords: Bag of features; Dynamical system classification; Multiple kernel learning; Surgical gesture classification; Time series classification.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms
Gestures*
Image Enhancement / methods
Image Interpretation, Computer-Assisted / methods*
Imaging, Three-Dimensional / methods
Motion
Pattern Recognition, Automated / methods*
Photography / methods*
Reproducibility of Results
Robotics / methods*
Sensitivity and Specificity
Surgery, Computer-Assisted / methods*
Suture Techniques
Video Recording / methods*