Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 18 (5)

Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos With Per-Frame Segmentation

Affiliations

Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos With Per-Frame Segmentation

Le Wang et al. Sensors (Basel).

Abstract

Inspired by the recent spatio-temporal action localization efforts with tubelets (sequences of bounding boxes), we present a new spatio-temporal action localization detector Segment-tube, which consists of sequences of per-frame segmentation masks. The proposed Segment-tube detector can temporally pinpoint the starting/ending frame of each action category in the presence of preceding/subsequent interference actions in untrimmed videos. Simultaneously, the Segment-tube detector produces per-frame segmentation masks instead of bounding boxes, offering superior spatial accuracy to tubelets. This is achieved by alternating iterative optimization between temporal action localization and spatial action segmentation. Experimental results on three datasets validated the efficacy of the proposed method, including (1) temporal action localization on the THUMOS 2014 dataset; (2) spatial action segmentation on the Segtrack dataset; and (3) joint spatio-temporal action localization on the newly proposed ActSeg dataset. It is shown that our method compares favorably with existing state-of-the-art methods.

Keywords: 3D ConvNets; LSTM; action localization; action segmentation.

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Flowchart of the proposed spatio-temporal action localization detector Segment-tube. As the input, an untrimmed video contains multiple frames of actions (e.g., all actions in a pair figure skating video), with only a portion of these frames belonging to a relevant category (e.g., the DeathSpirals). There are usually irrelevant preceding and subsequent actions (background). The Segment-tube detector alternates the optimization of temporal localization and spatial segmentation iteratively. The final output is a sequence of per-frame segmentation masks with precise starting/ending frames denoted with the red chunk at the bottom, while the background are marked with green chunks at the bottom.
Figure 2
Figure 2
Overview of the proposed coarse-to-fine temporal action localization. (a) coarse localization. Given an untrimmed video, we first generate saliency-aware video clips via variable-length sliding windows. The proposal network decides whether a video clip contains any actions (so the clip is added to the candidate set) or pure background (so the clip is directly discarded). The subsequent classification network predicts the specific action class for each candidate clip and outputs the classification scores and action labels. (b) fine localization. With the classification scores and action labels from prior coarse localization, further prediction of the video category is carried out and its starting and ending frames are obtained.
Figure 3
Figure 3
The diagrammatic sketch on the determination of video category k from video clips.
Figure 4
Figure 4
Qualitative temporal action localization results of the proposed Segment-tube detector for two action instances, i.e., (a) CliffDiving and (b) LongJump, in the testing split of the THUMOS 2014 dataset, with intersection-over-union (IoU) threshold being 0.5.
Figure 5
Figure 5
Example results of three state-of-the-art video object segmentation methods (VOS [48], FOS [47] and BVS [52]) and our proposed Segment-tube detector on the SegTrack dataset [26,27].
Figure 6
Figure 6
Sample frames and their ground truth annotations in the ActSeg dataset. Action frames are marked by green check marks and the corresponding boundaries are marked by polygons with red edges. The background (irrelevant) frames are marked by red cross marks.
Figure 7
Figure 7
Example results of three video object segmentation methods (VOS [48], FOS [47] and BVS [52]) and our proposed Segment-tube detector on the ActSeg dataset.
Figure 8
Figure 8
Qualitative spatio-temporal action localization results of the proposed Segment-tube for two action instances, i.e., (a) ArabequeSpin and (b) NoHandWindmill, in the testing split of the ActSeg dataset, with intersection-over-union (IoU) threshold being 0.5.

Similar articles

See all similar articles

Cited by 1 PubMed Central articles

References

    1. Wang L., Qiao Y., Tang X. Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognit. Chall. 2014;1:2.
    1. Simonyan K., Zisserman A. Two-stream convolutional networks for action recognition in videos; Proceedings of the Advances in Neural Information Processing Systems; Montreal, QC, Canada. 8–13 December 2014; pp. 568–576.
    1. Wang L., Qiao Y., Tang X. Action recognition with trajectory-pooled deep-convolutional descriptors; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Boston, MA, USA. 7–12 June 2015; pp. 4305–4314.
    1. Tran D., Bourdev L., Fergus R., Torresani L., Paluri M. Learning spatiotemporal features with 3D convolutional networks; Proceedings of the IEEE International Conference on Computer Vision; Santiago, Chile. 11–18 December 2015; pp. 4489–4497.
    1. Donahue J., Anne Hendricks L., Guadarrama S., Rohrbach M., Venugopalan S., Saenko K., Darrell T. Long-term recurrent convolutional networks for visual recognition and description; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Boston, MA, USA. 7–12 June 2015; pp. 2625–2634.
Feedback