Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 23;21(16):5676.
doi: 10.3390/s21165676.

Point Cloud Hand-Object Segmentation Using Multimodal Imaging with Thermal and Color Data for Safe Robotic Object Handover

Affiliations
Free PMC article

Point Cloud Hand-Object Segmentation Using Multimodal Imaging with Thermal and Color Data for Safe Robotic Object Handover

Yan Zhang et al. Sensors (Basel). .
Free PMC article

Abstract

This paper presents an application of neural networks operating on multimodal 3D data (3D point cloud, RGB, thermal) to effectively and precisely segment human hands and objects held in hand to realize a safe human-robot object handover. We discuss the problems encountered in building a multimodal sensor system, while the focus is on the calibration and alignment of a set of cameras including RGB, thermal, and NIR cameras. We propose the use of a copper-plastic chessboard calibration target with an internal active light source (near-infrared and visible light). By brief heating, the calibration target could be simultaneously and legibly captured by all cameras. Based on the multimodal dataset captured by our sensor system, PointNet, PointNet++, and RandLA-Net are utilized to verify the effectiveness of applying multimodal point cloud data for hand-object segmentation. These networks were trained on various data modes (XYZ, XYZ-T, XYZ-RGB, and XYZ-RGB-T). The experimental results show a significant improvement in the segmentation performance of XYZ-RGB-T (mean Intersection over Union: 82.8% by RandLA-Net) compared with the other three modes (77.3% by XYZ-RGB, 35.7% by XYZ-T, 35.7% by XYZ), in which it is worth mentioning that the Intersection over Union for the single class of hand achieves 92.6%.

Keywords: deep neural network; hand segmentation; multimodal imaging; point cloud segmentation; thermal.

PubMed Disclaimer

Conflict of interest statement

This article has no conflict of interest with any organization.

Figures

Figure 1
Figure 1
Workflow for a hand–object segmentation approach using a multimodal 3D sensor system containing a 3D sensor, an RGB camera, and a thermal camera.
Figure 2
Figure 2
A multimodal 3D sensor system consisting of an active stereovision 3D sensor based on GOBO projection, an RGB camera (FLIR Grasshopper3), and a thermal camera (FLIR A35).
Figure 3
Figure 3
(a) A copper–plastic chessboard calibration target (upper) and its principle (bottom). (b) Comparison of calibration images with and without active lighting for color image, NIR image, and thermal image.
Figure 4
Figure 4
The principle of PointNet. Point positions are transformed into spatial features by two stages of MLPs, before they are pooled into a global feature vector describing the whole object. Afterwards, a combination of local and global features can be used for segmentation purposes [22].
Figure 5
Figure 5
The multilevel architecture of PointNet++. Explicit neighborhood search in the point clouds is used to extract local features by means of a locally applied PointNet in multiple stages. These strong local feature can be used for object classification (lower branch) or for segmentation (upper branch) [23].
Figure 6
Figure 6
The architecture of the local feature aggregation module of RandLA-Net, which consists of multiple Local Spatial Encoding layers (LocSE) and Attentive Pooling layers (AP). In LocSE, the geometric information of a local area in the point cloud is encoded and then concatenated with the point features for local feature extraction. The local features are further aggregated by an AP layer [24].
Figure 7
Figure 7
Overview of the GOBO-Dataset with 12 classes (10 objects, background, and hand): (a) all objects; (b) examples of multimodal 3D data (Box, Human figure doctor, and Kitchen board).
Figure 8
Figure 8
The convergence curves in training phase of RandLA-Net, in which RandLA-Net was trained for 400 epochs on data XYZ, XYZ-T, XYZ-RGB, and XYZ-RGB-T. Upper—the training curve; bottom—the validation curve.
Figure 9
Figure 9
Visualization of experimental results for RandLA-Net on individual samples of the test dataset: The first row shows the ground truth and segmentation by XYZ, XYZ-T, XYZ-RGB, and XYZ-RGB-T, while the hand class is labeled in red. The second row shows the color point cloud, thermal point cloud, and the feature point cloud generated by XYZ, XYZ-T, XYZ-RGB, and XYZ-RGB-T. In the feature point cloud, the euclidean distances between a reference point (red point) and all other points are calculated and normalized in features space. The distances are color coded (light yellow—similar points, dark blue—dissimilar points).

Similar articles

Cited by

References

    1. Redmon J., Divvala S., Girshick R., Farhadi A. You only look once: Unified, real-time object detection; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. 27–30 June 2016; pp. 779–788.
    1. He K., Gkioxari G., Dollár P., Girshick R. Mask r-cnn; Proceedings of the IEEE International Conference on Computer Vision; Venice, Italy. 22–29 October 2017; pp. 2961–2969.
    1. Kirillov A., Wu Y., He K., Girshick R. Pointrend: Image segmentation as rendering; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA. 14–19 June 2020; pp. 9799–9808.
    1. Palmero C., Clapés A., Bahnsen C., Møgelmose A., Moeslund T.B., Escalera S. Multi-modal rgb–depth–thermal human body segmentation. Int. J. Comput. Vis. 2016;118:217–239.
    1. Zhao S., Yang W., Wang Y. A new hand segmentation method based on fully convolutional network; Proceedings of the 2018 Chinese Control And Decision Conference (CCDC); Shenyang, China. 9–11 June 2018; pp. 5966–5970.

LinkOut - more resources