A vision transformer for decoding surgeon activity from surgical videos

Dani Kiyasseh; Runzhuo Ma; Taseen F Haque; Brian J Miles; Christian Wagner; Daniel A Donoho; Animashree Anandkumar; Andrew J Hung

doi:10.1038/s41551-023-01010-8

A vision transformer for decoding surgeon activity from surgical videos

Nat Biomed Eng. 2023 Jun;7(6):780-796. doi: 10.1038/s41551-023-01010-8. Epub 2023 Mar 30.

Authors

Dani Kiyasseh¹, Runzhuo Ma², Taseen F Haque², Brian J Miles³, Christian Wagner⁴, Daniel A Donoho⁵, Animashree Anandkumar⁶, Andrew J Hung⁷

Affiliations

¹ Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA. danikiy@hotmail.com.
² Center for Robotic Simulation and Education, Catherine & Joseph Aresty Department of Urology, University of Southern California, Los Angeles, CA, USA.
³ Department of Urology, Houston Methodist Hospital, Houston, TX, USA.
⁴ Department of Urology, Pediatric Urology and Uro-Oncology, Prostate Center Northwest, St. Antonius-Hospital, Gronau, Germany.
⁵ Division of Neurosurgery, Center for Neuroscience, Children's National Hospital, Washington, DC, USA.
⁶ Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA.
⁷ Center for Robotic Simulation and Education, Catherine & Joseph Aresty Department of Urology, University of Southern California, Los Angeles, CA, USA. ajhung@gmail.com.

Abstract

The intraoperative activity of a surgeon has substantial impact on postoperative outcomes. However, for most surgical procedures, the details of intraoperative surgical actions, which can vary widely, are not well understood. Here we report a machine learning system leveraging a vision transformer and supervised contrastive learning for the decoding of elements of intraoperative surgical activity from videos commonly collected during robotic surgeries. The system accurately identified surgical steps, actions performed by the surgeon, the quality of these actions and the relative contribution of individual video frames to the decoding of the actions. Through extensive testing on data from three different hospitals located in two different continents, we show that the system generalizes across videos, surgeons, hospitals and surgical procedures, and that it can provide information on surgical gestures and skills from unannotated videos. Decoding intraoperative activity via accurate machine learning systems could be used to provide surgeons with feedback on their operating skills, and may allow for the identification of optimal surgical behaviour and for the study of relationships between intraoperative factors and postoperative outcomes.

A vision transformer for decoding surgeon activity from surgical videos

Authors

Affiliations

Abstract

Publication types

MeSH terms

Grants and funding