Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Sep 23:5:14351.
doi: 10.1038/srep14351.

Analyzing animal behavior via classifying each video frame using convolutional neural networks

Affiliations

Analyzing animal behavior via classifying each video frame using convolutional neural networks

Ulrich Stern et al. Sci Rep. .

Abstract

High-throughput analysis of animal behavior requires software to analyze videos. Such software analyzes each frame individually, detecting animals' body parts. But the image analysis rarely attempts to recognize "behavioral states"-e.g., actions or facial expressions-directly from the image instead of using the detected body parts. Here, we show that convolutional neural networks (CNNs)-a machine learning approach that recently became the leading technique for object recognition, human pose estimation, and human action recognition-were able to recognize directly from images whether Drosophila were "on" (standing or walking) or "off" (not in physical contact with) egg-laying substrates for each frame of our videos. We used multiple nets and image transformations to optimize accuracy for our classification task, achieving a surprisingly low error rate of just 0.072%. Classifying one of our 8 h videos took less than 3 h using a fast GPU. The approach enabled uncovering a novel egg-laying-induced behavior modification in Drosophila. Furthermore, it should be readily applicable to other behavior analysis tasks.

PubMed Disclaimer

Figures

Figure 1
Figure 1. The problem addressed with neural networks and bird’s-eye view of the methodology.
(a) Sample frame from one of our videos, showing two chambers with one fly in each chamber (white arrows). For the left chamber, the top edge of the chamber sidewall is outlined in yellow, and the two egg-laying substrates at the bottom of the chamber are outlined in white. The yellow arrow points to one of the many eggs laid on the plain substrates. (b) Schematic of the cross section of one chamber. We record through the lid with a camera above the chamber. In all three positions (1, 2, 3) shown, the fly would appear over the egg-laying substrate in a video, but it is on it only in position 1. (c) Sample fly images where the flies appears over the egg-laying substrate, with green and red labels indicating whether they are actually “on” or “off” the substrate. Flies on the sidewall often show a characteristic wing shape (white arrows). Flies on the lid are closer to the camera and appear larger and out of focus (yellow arrow). The fly images here have lower resolution than (a) since we reduced resolution for tracking and classification. (d) CNN in training mode. See text for details. (e) CNN in test mode. For each image the CNN is presented with, it calculates P(“on”), the probability the image is “on” substrate. We considered the net to classify an image as “on” if and only if P(“on”) ≥0.5. (f) Overview of our video analysis, which employs both positional tracking by Ctrax and classification by a CNN-based classifier. The classifier uses position information in two ways: first, to extract fly images from the full frame and, second (not shown), it needs to classify only if the fly is over the substrate (the fly is guaranteed to be “off” substrate otherwise). In our videos, the flies were over the substrate in typically about half of the frames.
Figure 2
Figure 2. Architecture of the net we used.
(a) Sample two-layer neural net where layer 2 is convolutional, showing the key ideas of convolution. First, the receptive field of each layer 2 neuron (or unit) is limited to a local subset of layer 1 neurons. Second, there are multiple types of layer 2 neurons (here round and square; the connections for square are not shown). Third, all neurons of the same type in layer 2 share weights; e.g., all red connections have the same weight value, and during learning, the weights of the red connections are adjusted identically. (b) Schematic of the architecture we used, with three convolutional and two fully-connected (fc) layers. Like in the architecture that won ILSVRC2012, each convolutional layer was followed by a max-pooling layer (not shown) with overlapping pooling, and we used rectified linear units. For full details of the architecture, see the layer definition file in project yanglab-convnet on Google code. (c) Visualization of the weights learned by one of our CNNs for the conv1 layer. See text for details.
Figure 3
Figure 3. How we trained the nets.
(a–b) Overview of how we created “on”/“off” labeled fly images for training and testing the CNNs. See text for details. (c) Overview of how we augmented the data. See text for details. (d) Sample images for the four image transformations we used to augment the data. (e) Data augmentation reduces the error rate. The control has all four image transformations enabled during training (“full augmentation”), the other bars show cases with only three of the four transformations enabled (i.e., one transformation disabled). All nets were trained for 800 epochs on the 1/5 training set. n = 30 nets per bar, bars show mean with SD, also for the following panels. One-way ANOVA followed by Dunnett’s test, p < 0.0001. (f) Full augmentation vs. no augmentation. All nets were trained on the 1/5 training set. For full augmentation, 800 epochs were used. For no augmentation, 400 epochs were used since the error rate was less for 400 epochs (mean 1.677%) than for 800 epochs (mean 1.740%), which is due to earlier overfitting (see text for Fig. 3g) for no augmentation. Welch’s t-test, p < 0.0001, two-tailed. (g) Additional training, up to a point, reduces the error rate. See text for why this is generally the case. All nets were trained on the 1/5 training set with full augmentation. One-way ANOVA followed by Šídák’s test, p < 0.0001. (h) Increasing the size of the training set reduces the error rate when the number of epochs is constant. The numbers of images in the 1/5, 2/5, and 3/5 training sets are 5,400, 2*5,400, and 3*5,400, respectively. All nets were trained for 800 epochs with full augmentation. One-way ANOVA followed by Šídák’s test, p < 0.0001. (i) The 3/5 training set reduced the error rate compared to the 1/5 training set when the total number of images seen by the CNN during training was constant. “1/5 2400e” denotes training on the 1/5 training set for 2400 epochs, etc. All nets were trained with full augmentation. Welch’s t-test, p = 0.026, two-tailed.
Figure 4
Figure 4. Applying the nets to classifying single images.
(a) Overview of model averaging using n models (CNNs). See text for details. (b) Overview of augmentation for test using m images. See text for details. (c) Model averaging (ma) reduces the error rate. “ma n = 5” denotes model averaging using 5 models, etc. Each bar is based on the same 30 nets, and the bootstrap is used to estimate the mean and variance of model averaging by repeatedly (500 times) sampling with replacement the n nets used for model averaging from the 30 nets. All nets were trained for 800 epochs on the 1/5 training set with full augmentation. Bars show mean with SD, also for the following panels. No statistical tests were run since the bootstrap gives only estimates of the error rate distributions. (d) Augmentation for test reduces the error rate. All nets were trained for 3200 epochs on the 1/5 training set with full augmentation. Same n = 30 nets for the five bars, repeated measures ANOVA with Geisser-Greenhouse correction followed by Šídák’s test, p < 0.0001. (e) Validation and test error rates for our “best nets”, both without and with model averaging. The best nets were trained using full augmentation on the 3/5 training set for 800 epochs and used shift and brightness change augmentation during testing. Same 30 nets for all four bars, model averaging estimated using the bootstrap. (f) Validation set images that were difficult for the best nets. Model averaging of 20 nets and 500 bootstrap repeats were used to determine difficult images. The images are shown with the 2 prior and 2 next frames in the video, which can help humans to assess the cases. See text for a discussion of the first two cases. In the last case, it is unclear why the nets tended to make a mistake. It is possible the darker area close to the head of the fly (white arrow) was mistaken for the characteristic “sidewall wing” shape, an error humans would clearly not make.
Figure 5
Figure 5. Applying the nets to videos.
(a) Overview of how we applied the nets to videos. See text for details. (b) Fly images sequence from consecutive frames classified by the nets. The fly walks from the sidewall onto the substrate in this case. Unlike earlier in the paper, the labels no longer represent human classification but now represent the nets’ classification. (c) Fly image sequence where the majority filter fixed a mistake of the nets. The same sequence is shown before (top) and after (bottom) the fix, with the new (correct) label in yellow. See text for details of the majority filter.
Figure 6
Figure 6. “On”/“off” classification for one 8 h sucrose vs. plain experiment.
(a) Chamber image with egg-laying substrates outlined. (b) Visualization of the egg-laying events for one 8 h sucrose vs. plain experiment. The fly laid 46 eggs, all on the plain substrate, during the 8 hours. (c) “On”/“off” classification for the same 8 h sucrose vs. plain experiment.
Figure 7
Figure 7. Drosophila females show increased sucrose contact prior to egg-laying.
(a) Chamber image with sample trajectory for 20 s interval before visit to plain substrate. For this trajectory, the fly is “off” substrate for all frames, mostly on the sidewall. (b) Sample 20 s interval before plain visit with “on”/“off” classification. The green line at the end of the interval represents the first frame of the plain visit following the interval. During this plain visit, the female may or may not lay eggs. (c–d) 20 s intervals before plain visits with (c) and without (d) egg-laying for sucrose vs. plain assay. All intervals are from an 8 h video. The egg-laying times were manually annotated and are given next to the intervals in (c). The intervals for (d) were randomly chosen among the plain visits without egg-laying. (e) 20 s intervals before plain visits with egg-laying for plain vs. plain assay. All intervals are from an 8 h video and represent about half of the egg-laying events—those laid on one of the two plain sites. About an equal number of eggs was laid on the other plain (“opposite plain”) site. The egg-laying times were manually annotated and are given next to the intervals. (f) Fractions of 20 s intervals with visit to sucrose, plain, or opposite plain for the three cases from (c–e). The 20 s intervals for the four sucrose vs. plain (S-P) bars are from 10 flies, each recorded for 8 hours, yielding 540 intervals before plain visits with egg-laying and 400 randomly chosen intervals before plain visits without egg-laying (40 per fly). The 20 s intervals for the plain vs. plain bar are from 5 flies, each recorded for 8 hours, yielding 340 intervals before plain visits with egg-laying. Same n = 10 flies for first four bars, n = 5 flies for last bar, repeated measures ANOVA with Geisser-Greenhouse correction followed by Šídák’s test, p < 0.0001, and Welch’s t-test, p < 0.0001, two-tailed. Using the Bonferroni correction to adjust for the additional comparison (t-test) does not change significance.

Similar articles

Cited by

References

    1. Eyjolfsdottir E. et al. Detecting Social Actions of Fruit Flies. In Computer Vision–ECCV 2014. 772–787 (Springer, 2014).
    1. Branson K., Robie A. A., Bender J., Perona P. & Dickinson M. H. High-throughput ethomics in large groups of Drosophila. Nat. Methods 6, 451–457, 10.1038/nmeth.1328 (2009). - DOI - PMC - PubMed
    1. Dankert H., Wang L., Hoopfer E. D., Anderson D. J. & Perona P. Automated monitoring and analysis of social behavior in Drosophila. Nat. Methods 6, 297–303, 10.1038/nmeth.1310 (2009). - DOI - PMC - PubMed
    1. Fontaine E. I., Zabala F., Dickinson M. H. & Burdick J. W. Wing and body motion during flight initiation in Drosophila revealed by automated visual tracking. J. Exp. Biol. 212, 1307–1323, 10.1242/jeb.025379 (2009). - DOI - PubMed
    1. Kohlhoff K. J. et al. The iFly tracking system for an automated locomotor and behavioural analysis of Drosophila melanogaster. Integr. Biol. (Camb.) 3, 755–760, 10.1039/c0ib00149j (2011). - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources