Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation

Le Wang; Xuhuan Duan; Qilin Zhang; Zhenxing Niu; Gang Hua; Nanning Zheng

doi:10.3390/s18051657

Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation

Sensors (Basel). 2018 May 22;18(5):1657. doi: 10.3390/s18051657.

Authors

Le Wang¹, Xuhuan Duan², Qilin Zhang³, Zhenxing Niu⁴, Gang Hua⁵, Nanning Zheng⁶

Affiliations

¹ Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, Shannxi 710049, China. lewang@xjtu.edu.cn.
² Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, Shannxi 710049, China. duanxuhuan0123@stu.xjtu.edu.cn.
³ HERE Technologies, Chicago, IL 60606, USA. qilin.zhang@here.com.
⁴ Alibaba Group, Hangzhou 311121, China. zhenxing.nzx@alibaba-inc.com.
⁵ Microsoft Research, Redmond, WA 98052, USA. ganghua@microsoft.com.
⁶ Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, Shannxi 710049, China. nnzheng@xjtu.edu.cn.

Abstract

Inspired by the recent spatio-temporal action localization efforts with tubelets (sequences of bounding boxes), we present a new spatio-temporal action localization detector Segment-tube, which consists of sequences of per-frame segmentation masks. The proposed Segment-tube detector can temporally pinpoint the starting/ending frame of each action category in the presence of preceding/subsequent interference actions in untrimmed videos. Simultaneously, the Segment-tube detector produces per-frame segmentation masks instead of bounding boxes, offering superior spatial accuracy to tubelets. This is achieved by alternating iterative optimization between temporal action localization and spatial action segmentation. Experimental results on three datasets validated the efficacy of the proposed method, including (1) temporal action localization on the THUMOS 2014 dataset; (2) spatial action segmentation on the Segtrack dataset; and (3) joint spatio-temporal action localization on the newly proposed ActSeg dataset. It is shown that our method compares favorably with existing state-of-the-art methods.

Keywords: 3D ConvNets; LSTM; action localization; action segmentation.