Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Oct 4:8:e48571.
doi: 10.7554/eLife.48571.

DeepFly3D, a deep learning-based approach for 3D limb and appendage tracking in tethered, adult Drosophila

Affiliations

DeepFly3D, a deep learning-based approach for 3D limb and appendage tracking in tethered, adult Drosophila

Semih Günel et al. Elife. .

Abstract

Studying how neural circuits orchestrate limbed behaviors requires the precise measurement of the positions of each appendage in three-dimensional (3D) space. Deep neural networks can estimate two-dimensional (2D) pose in freely behaving and tethered animals. However, the unique challenges associated with transforming these 2D measurements into reliable and precise 3D poses have not been addressed for small animals including the fly, Drosophila melanogaster. Here, we present DeepFly3D, a software that infers the 3D pose of tethered, adult Drosophila using multiple camera images. DeepFly3D does not require manual calibration, uses pictorial structures to automatically detect and correct pose estimation errors, and uses active learning to iteratively improve performance. We demonstrate more accurate unsupervised behavioral embedding using 3D joint angles rather than commonly used 2D pose data. Thus, DeepFly3D enables the automated acquisition of Drosophila behavioral measurements at an unprecedented level of detail for a variety of biological applications.

Keywords: 3D pose estimation; D. melanogaster; animal behavior; computer vision; deep learning; neuroscience; unsupervised classification.

PubMed Disclaimer

Conflict of interest statement

SG, HR, DM, JC, PR, PF No competing interests declared

Figures

Figure 1.
Figure 1.. Deriving 3D pose from multiple camera views.
(A) Raw image inputs to the Stacked Hourglass deep network. (B) Probability maps output from the trained deep network. For visualization purposes, multiple probability maps have been overlaid for each camera view. (C) 2D pose estimates from the Stacked Hourglass deep network after applying pictorial structures and multi-view algorithms. (D) 3D pose derived from combining multiple camera views. For visualization purposes, 3D pose has been projected onto the original 2D camera perspectives. (E) 3D pose rendered in 3D coordinates. Immobile thorax-coxa joints and antennal joints have been removed for clarity.
Figure 2.
Figure 2.. Mean absolute error distribution.
(A) PCK (percentage of keypoints) accuracy as a function of mean absolute error (MAE) threshold. (B) Evaluating network prediction error in a low data regime. The Stacked Hourglass network (blue circles) shows near asymptotic prediction error (red dashed line), even when trained with only 400 annotated images. After 800 annotations, there are minimal improvements to the MAE. (C) MAE for different limb landmarks. Violin plots are overlaid with raw data points (white circles).
Figure 3.
Figure 3.. Pose estimation accuracy before and after using pictorial structures.
Pixel-wise 2D pose errors/residuals (top) and their respective distributions (bottom) (A) before, or (B) after applying pictorial structures. Residuals larger than 35 pixels (red circles) represent incorrect keypoint detections. Those below this threshold (blue circles) represent correct keypoint detections.
Figure 4.
Figure 4.. Unsupervised behavioral classification of 2D pose data is sensitive to viewing angle.
(A) Behavioral map derived using 2D pose data from three adjacent cameras (Cameras 1, 2, and 3) but the same animals and experimental time points. Shown are clusters (black outlines) that are enriched (yellow), or sparsely populated (blue) with data. Different clusters are enriched for data from either (B) camera 1, (C) camera 2, or (D) camera 3. Behavioral embeddings were derived using 1 million frames during 4 s of optogenetic stimulation of MDN>CsChrimson (n = 6 flies, n = 29 trials), aDN>CsChrimson (n = 6 flies, n = 30 trials), and wild-type control animals (MDN-GAL4/+: n = 4 flies, n = 20 trials. aDN-GAL4/+: n = 4 flies, n = 23 trials).
Figure 5.
Figure 5.. Unsupervised behavioral classification of 3D joint angle data.
Behavioral embeddings were calculated using 3D joint angles from the same 1 million frames used in Figure 4A. (A) Behavioral map combining all data during 4 s of optogenetic stimulation of MDN>CsChrimson (n = 6 flies, n = 29 trials), aDN>CsChrimson (n = 6 flies, n = 30 trials), and wild-type control animals (For MDN-Gal4/+, n = 4 flies, n = 20 trials. For aDN-Gal4/+ n = 4 flies, n = 23 trials). The same behavioral map is shown with only the data from (B) MDN>CsChrimson stimulation, (C) aDN>CsChrimson stimulation, or (D) control animal stimulation. Associated videos reveal that these distinct map regions are enriched for backward walking, antennal grooming, and forward walking, respectively.
Figure 6.
Figure 6.. The DeepFly3D pose estimation pipeline.
(A) Data acquisition from the multi-camera system. (B) Training and retraining of 2D pose. (C) 3D pose estimation.
Figure 7.
Figure 7.. The DeepFly3D annotation tool.
This GUI allows the user to manually annotate joint positions on images from each of seven cameras. Because this tool can be accessed from a web browser, annotations can be performed in a distributed manner across multiple users more easily. A full description of the annotation tool can be found in the online documentation: https://github.com/NeLy-EPFL/DeepFly3D. Scale bar is 50 pixels.
Figure 8.
Figure 8.. Camera calibration.
(A) Correcting erroneous 2D pose estimations by using epipolar relationships. Only 2D pose estimates without large epipolar errors are used for calibration. x2 represents a 2D pose estimate from the middle camera. Epipolar lines are indicated as blue and red lines on the image plane. (B) The triangulated point, XT, uses the initial camera parameters. However, due to the coarse initialization of each camera’s extrinsic properties, observations from each camera do not agree with one another and do not yield a reasonable 3D position estimate. (C) The camera locations are corrected, generating an accurate 3D position estimate by optimizing Equation 7 using only the pruned 2D points.
Figure 9.
Figure 9.. 3D pose correction for one leg using the MAP solution and pictorial structures.
(A) Candidate 3D pose estimates for each keypoint are created by triangulating local maxima from probability maps generated by the Stacked Hourglass deep network. (B) For a selection of these candidate estimates, we can assign a probability using Equation 8. However, calculating this probability for each pair of points is computationally intractable. (C) By exploiting the chain structure of Equation 8, we can instead pass a probability distribution across layers using a belief propagation algorithm. Messages are passed between layers as a function of parent nodes, describing the belief of the child nodes on each parent node. Grayscale colors represent the calculated belief of each node where darker colors indicate higher belief. (D) Corrected pose estimates are obtained during the second backward iteration, by selecting the nodes with largest belief. We discard nodes (x’s) that have non-maximal belief during backwards message passing. Note that beliefs have been adjusted after forward message passing.
Figure 10.
Figure 10.. Pose correction using pictorial structures.
(A) Raw input data from four cameras, focusing on the pretarsus of the middle left leg. (B) Probability maps for the pretarsus output from the Stacked Hourglass deep network. Two maxima (white arrowheads) are present on the probability maps for camera 5. The false-positive has a larger unary probability. (C) Raw predictions of 2D pose estimation without using pictorial structures. The pretarsus label is incorrectly applied (white arrowhead) in camera 5. By contrast, cameras 4, 6, and 7 are correctly labeled. (D) Corrected pose estimation using pictorial structures. The false-positive is removed due to the high error measured in Equation 8. The newly corrected pretarsus label for camera five is shown (white arrowhead).
Figure 11.
Figure 11.. DeepFly3D graphical user interface (GUI) applied to with the Human3.6M dataset (Ionescu et al., 2014).
To use the DeepFly3D GUI on any new dataset (Drosophila or otherwise), users can provide an initial small set of manual annotations. Using these annotations, the software calculates the epipolar geometry, performs camera calibration, and trains the 2D pose estimation deep network. A description of how to adopt DeepFly3D for new datasets can be found in the Materials and methods section and, in greater detail, online: https://github.com/NeLy-EPFL/DeepFly3D. This figure is licensed for academic use only and thus is not available under CC-BY and is exempt from the CC-BY 4.0 license.
Figure 12.
Figure 12.. DeepFly3D graphical user interface (GUI).
The top-left buttons enable operations like 2D pose estimation, camera calibration, and saving the final results. The top-right buttons can be used to visualize the data in different ways: as raw images, probability maps, 2D pose, or the corrected pose following pictorial structures. The bottom-left buttons permit frame-by-frame navigation. A full description of the GUI can be found in the online documentation: https://github.com/NeLy-EPFL/DeepFly3D.
Figure 13.
Figure 13.. A schematic of the seven camera spherical treadmill and optogenetic stimulation system that was used in this study.
Author response image 1.
Author response image 1.. Interpreting and choosing the PCK Threshold.
(a) An illustration of the length of the PCK (percentage of keypoints) error threshold (Scale bar, bottom right) for comparison with limb segment lengths. The 50 pixel threshold is approximately one-third the length of the prothoracic femur. (b) PCK accuracy as a function of mean absolute error (MAE) threshold.
Author response image 2.
Author response image 2.. Triangulation error for each joint as a function of number of cameras used for calibration.
Mean absolute error (MAE) as a function of number of cameras used for triangulation and calibration, as well as human annotation versus network prediction. Calibration using all 7 cameras and human annotation is the ground truth (red circles). This ground truth is compared with using N cameras for triangulation and performing calibration from N cameras using either human annotation of images (blue circles) or network predictions of images (yellow circles). A comparison across joints (individual panels) demonstrates that certain joints (e.g., tarsus tips) are more susceptible to increased errors with fewer cameras.
Author response image 3.
Author response image 3.. MAE errors for limb and antennal landmarks.
MAE as a function of landmark showing the differential distribution of errors. Violin plots are overlaid with raw data points (white circles).

Similar articles

Cited by

References

    1. Andriluka M, Pishchulin L, Gehler P, Schiele B. 2d human pose estimation: new benchmark and state of the art analysis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2014. pp. 3686–3693. - DOI
    1. Bender JA, Simpson EM, Ritzmann RE. Computer-assisted 3D kinematic analysis of all leg joints in walking insects. PLOS ONE. 2010;5:e13617. doi: 10.1371/journal.pone.0013617. - DOI - PMC - PubMed
    1. Berman GJ, Choi DM, Bialek W, Shaevitz JW. Mapping the stereotyped behaviour of freely moving fruit flies. Journal of the Royal Society Interface. 2014;11:20140672. doi: 10.1098/rsif.2014.0672. - DOI - PMC - PubMed
    1. Bidaye SS, Machacek C, Wu Y, Dickson BJ. Neuronal control of Drosophila walking direction. Science. 2014;344:97–101. doi: 10.1126/science.1249964. - DOI - PubMed
    1. Bishop CM. Pattern Recognition and Machine Learning. Springer; 2006.

Publication types

LinkOut - more resources