Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May;18(5):564-573.
doi: 10.1038/s41592-021-01106-6. Epub 2021 Apr 19.

Geometric deep learning enables 3D kinematic profiling across species and environments

Affiliations

Geometric deep learning enables 3D kinematic profiling across species and environments

Timothy W Dunn et al. Nat Methods. 2021 May.

Abstract

Comprehensive descriptions of animal behavior require precise three-dimensional (3D) measurements of whole-body movements. Although two-dimensional approaches can track visible landmarks in restrictive environments, performance drops in freely moving animals, due to occlusions and appearance changes. Therefore, we designed DANNCE to robustly track anatomical landmarks in 3D across species and behaviors. DANNCE uses projective geometry to construct inputs to a convolutional neural network that leverages learned 3D geometric reasoning. We trained and benchmarked DANNCE using a dataset of nearly seven million frames that relates color videos and rodent 3D poses. In rats and mice, DANNCE robustly tracked dozens of landmarks on the head, trunk, and limbs of freely moving animals in naturalistic settings. We extended DANNCE to datasets from rat pups, marmosets, and chickadees, and demonstrate quantitative profiling of behavioral lineage during development.

PubMed Disclaimer

Figures

Figure 1 ∣
Figure 1 ∣. Fully 3D deep learning versus 2D-to-3D triangulation for naturalistic 3D pose detection.
A. Schematic of the methodological approach. B. Left, schematic of a post hoc triangulation approach, in which a 2D pose detection network makes independent predictions of 2D landmark positions in each view and then triangulates detected landmarks. Red arrowhead: error in 2D landmark positioning. Right top, projection of a DLC 3D prediction into a frame from a single view (Supplementary Fig. 1). Right bottom, DLC accuracy as the fraction of timepoints in which at least N of 20 landmarks are successfully tracked in 3D (N = 3 animals, N = 894 landmarks, 75 timepoints). C. The DANNCE approach, in which a 3D volume is constructed from the image in each view, and then these volumes are processed by a 3D CNN to directly predict 3D landmark positions. D. Full schematic of DANNCE.
Figure 2 ∣
Figure 2 ∣. Rat 7M, a training and benchmark dataset for 3D pose detection.
A. Schematic of the Rat 7M collection setup. B. Markers detected by motion capture cameras are triangulated across views to reconstruct the animal’s 3D pose and projected into camera images as labels to train 2D pose detection networks. C. Illustration of process by which tracked landmarks are used to identify individual behaviors. The temporal dynamics of individual markers are projected onto principal axes of pose (eigenpostures) and transformed into wavelet spectrograms that represent the temporal dynamics at multiple scales. D. tSNE representations of eigenposture and wavelet traces, as well as behavioral density maps and isolated clusters obtained via watershed transform over a density representation of the tSNE space. E. Individual examples from each of the high-level clusters outlined in bold in (D). Reprojection of the same 3D pose onto 3 different views (Top) and 3D rendering of the 3D pose in each example (Bottom). The numbers are the total number of example images for each behavioral category. 728,028 frames with motion capture data where animal speed was below the behavioral categorization threshold are excluded.
Figure 3 ∣
Figure 3 ∣. DANNCE outperforms DLC on rats with and without markers.
A. Box plots of Euclidean error for the indicated methods in a recording using a validation animal not used for training. “More data” is from a model trained with 5 animals rather than 4. “Fine-tune” is after fine-tuning the “more data” model with an additional 225 3D poses from this validation animal for each recording session (Supplementary Fig. 4). In DLC (DLT), the direct linear transformation method was used to triangulate across all cameras. DLC (median) takes the median of triangulations for all camera pairs. DANNCE 6-camera landmark positions are computed as the median of all 3-camera predictions (Supplementary Fig. 5C). N = 62,680 markers for all 6 camera methods, N = 1,253, 0 markers for all 3 camera methods. Inset colors follow the same legend as to the left and (A-D) use the same color legend. The box plots in (A) and (H) show median with inter-quartile range (IQR) and whiskers extending to 1.5x the IQR. The arithmetic mean is shown as a black square. B. Landmark prediction accuracy as a function of error threshold, for the same data and methods as in (A). C. Fraction of timepoints with the indicated number of markers accurately tracked at a threshold of 18 mm, for the same data and methods as in (A). D. Landmark prediction accuracy at a threshold of 18 mm, broken down by landmark types, for the same data and methods as in (A). E. Examples showing the Euclidean error over time for a section of recording in the validation animal. Thick colored lines as the bottom denote the type of behavior engaged in over time: grooming (all grooming and scratching), active (walking, investigation, and wet dog shake), rearing, and idling (prone still and adjust posture). F-G. Mean Euclidean error (F) and accuracy (G) on the validation subject for DANNCE when using a single camera for prediction, vs. DLC when using two cameras. Squares show the mean error for individual camera sets (6 different single camera possibilities for DANNCE, 15 different pairs for DLC; N = 9·106, 2.25·107 markers for DANNCE and DLC, respectively). H. Box plots of overall Euclidean error in markerless rats relative to triangulated 3D human labels, for each of the indicated methods. N = 3 animals, N = 721 landmarks. I. Plots showing the mean Euclidean error for the same data and methods as in (H), broken down by landmark type. Each square is the mean and error bars are standard deviation. J. Landmark reconstruction accuracy as a function of error threshold for the same animals and frames as in (H), but with all landmarks pooled across labelers for Human (N = 1,868), and all 20 predictions per frame for DANNCE and DLC (N = 1,980 and 39,600 landmarks for each 6 camera and 3 camera condition, respectively). K. Fraction of all frames with the indicated number of landmarks accurately reconstructed at a threshold of 18 mm, for the same data as in (J). The “Human” line is truncated at 19 landmarks because labelers were unable to see the full set of landmarks in at least 2 views. L. Fraction of all frames fully reconstructed (all 20 landmarks with error below threshold) as a function of the error threshold for the same data as in (J). (I)-(L) use the same colors as in (H).
Figure 4 ∣
Figure 4 ∣. Kinematic profiling of the mouse behavioral repertoire.
A. Schematic of the high-resolution mouse recording arena. B. Example 3D DANNCE predictions (top), and video reprojections of every third frame (bottom), of a rearing sequence in a mouse not bearing markers. C. Density map (Left), and corresponding low- and high-level clusters (light and dark outlines, respectively, Right) of mouse behavioral space isolated from 3 hours of recording in 3 mice. D. 3D renderings of examples from the indicated behavioral categories in (C). E-H. Left, power spectral density (PSD) for individual landmarks at the indicated anatomical positions in a single walking (E), face grooming (F), and left (G) and right (H) forelimb grooming cluster (N = 44, 41, 333, 33 repetitions, respectively). Right, example kinematic traces (x-velocity only) during a single instance of each behavior for the same markers as to the left. All examples in (E-H) are derived from a single mouse.
Figure 5 ∣
Figure 5 ∣. DANNCE can report the ontogeny of behavioral complexity in rats.
A. Examples of DANNCE landmark predictions projected into a single camera view, for four different developmental stages. B-C. Box plots of landmark Euclidean error (B) and bar plots of DANNCE mean landmark prediction accuracy (C) in validation subjects for both hand-labeled frames and DANNCE predictions, broken down by landmark type. The box plots show median with IQR and whiskers extending to 1.5x the IQR. The arithmetic mean is also shown as a black square. The mean segment length between landmarks of each type is presented for scale in (B). In (C), Blue squares show the landmark prediction accuracy for individual validation subjects. For each developmental timepoint, N = 3 animals and N = 396 – 417 landmarks. D. Clustered and annotated maps of pup behavior for each developmental timepoint. We scaled the size of the behavioral maps to reflect the diversity of behaviors observed. E. Bar plots of behavioral complexity, defined as the range of pairwise distances between behaviors observed in the dataset, normalized to P7 and shown across different developmental timepoints. Error bars reflect the standard deviation of the complexity across 50 bootstrapped samples. F. Grid quantification of behavioral similarity across developmental stages. For each stage we clustered the behavioral map, identified pairs of clusters across stage pairs with highest similarity, and reported the average highest similarity per cluster. G. Fractions of time spent in four major behavioral categories. Mean values (circles) were adjusted to reflect the fraction observed by humans in Supplementary Fig. 13. Error bars reflect the expected standard deviation in observations based on Poisson statistics (N = 484-65,008 per category, when present). H. Ontogeny of rearing behaviors. In the graphs, individual nodes refer to unique behavioral clusters at each stage. Edges connect nodes whose similarity is greater than a set threshold. Wireframe examples show the diversification of rearing behaviors from one initial cluster at P14. This cluster was linked to a P7 behavioral precursor with similarity below threshold (dotted lines). Gray dots and lines for P7 denote that these clusters were not identifiable as rearing movements.
Figure 6 ∣
Figure 6 ∣. 3D tracking across the behavioral repertoire of marmosets and chickadees.
A. Schematic of the naturalistic marmoset behavioral enclosure and video recording configuration (Supplementary Video 9). B. Box plots of marmoset segment length distances for hand-labeled frames (“Human”; head N = 72 segments; spine N = 52; tail N = 48; limbs N = 93) and DANNCE predictions (head N = 11,689 segments; spine N = 10,462; tail N = 10,472; limbs N = 88,236). Box plots in (B-C, I-J) show median with IQR and whiskers extending to 1.5x IQR. Black squares in the box plots are arithmetic means. C. Box plots of marmoset landmark Euclidean error in validation frames for hand-labeled frames and DANNCE predictions, broken down by landmark type. The mean segment length between landmarks of each type, in the human-annotated data, is presented for scale. N = 560 landmarks for each method. Head N = 105, spine N = 105, tail N = 70, limbs N = 280. Colors use the same key as in (B). D. Left, landmark prediction accuracy as a function of error threshold for the same data as in (C). Color code is the same as in (C). Right, landmark prediction accuracy as a function of error threshold for DANNCE only, broken down by landmark type. E. Plots of 3D animal position over time for a 10-minute recording session, projected onto the x-y (left) and y-z (right) planes of the arena. The color map encodes time from the start of the recording, from blue to purple. F. Top, heat map of tSNE behavioral embeddings from 23 minutes of video (40,020 frames) in a single animal. Bottom, annotated behavioral map. G. Individual examples extracted from clusters in (F). Colors of box outlines correspond to cluster colors. H. Schematic of the chickadee behavioral arena and video recording configuration. Only four of the six cameras are shown (Supplementary Video 10). I. Box plots of chickadee segment length distances for hand-labeled frames (head N = 70 segments; trunk N = 70; wings N = 140; legs N = 280) and DANNCE predictions (head N = 396,000 segments; trunk N = 396,000; wings N = 792,000; legs N = 1,584,000). J. Box plots of chickadee landmark Euclidean error in validation frames for both hand-labeled frames and DANNCE predictions, broken down by landmark type. The mean segment length between landmarks of each type, in the human-annotated data, is presented for scale. N = 310 landmarks for each method. Head N = 62, trunk N = 62, wings N = 62, legs N = 124. Colors use the same key as in (I). K. Landmark prediction accuracy as a function of error threshold for DANNCE only, broken down by landmark type, for the same data as in (J). L. Example DANNCE landmark predictions on the chickadee in the arena. M. Left, heat map of tSNE behavioral embeddings from 2 hours of video in a single animal. Right, annotated behavioral map. N. Individual examples extracted from clusters in (M). Colors of box outlines correspond to cluster colors.

Comment in

Similar articles

Cited by

References

    1. Wiltschko AB et al. Mapping Sub-Second Structure in Mouse Behavior. Neuron 88, 1121–1135 (2015). - PMC - PubMed
    1. Hong W et al. Automated measurement of mouse social behaviors using depth sensing, video tracking, and machine learning. Proc. Natl. Acad. Sci. U. S. A 112, E5351–60 (2015). - PMC - PubMed
    1. Alhwarin F, Ferrein A & Scholl I IR Stereo Kinect: Improving Depth Images by Combining Structured Light with IR Stereo. In PRICAI 2014: Trends in Artificial Intelligence 409–421 (2014).
    1. Mathis A et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci 21, 1281–1289 (2018). - PubMed
    1. Pereira TD et al. Fast animal pose estimation using deep neural networks. Nat. Methods 16, 117–125 (2019). - PMC - PubMed

Methods References

    1. Insafutdinov E, Pishchulin L, Andres B, Andriluka M & Schiele B Deepercut: A deeper, stronger, and faster multi-person pose estimation model. in European Conference on Computer Vision (ECCV) (2016).
    1. Hartley R & Zisserman A Multiple View Geometry in Computer Vision. (Cambridge University Press, 2003).
    1. Ronneberger O, Fischer P & Brox T U-Net: Convolutional Networks for Biomedical Image Segmentation. Miccai 234–241 (2015).
    1. Newell A, Yang K & Deng J Stacked Hourglass Networks for Human Pose Estimation. in European Conference on Computer Vision (ECCV) (2016).
    1. Glorot X & Bengio Y Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. - Proc. Track 9, 249–256 (2010).

Publication types

LinkOut - more resources