Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul 12;31(13):2785-2795.e4.
doi: 10.1016/j.cub.2021.04.014. Epub 2021 May 4.

Explaining face representation in the primate brain using different computational models

Affiliations

Explaining face representation in the primate brain using different computational models

Le Chang et al. Curr Biol. .

Abstract

Understanding how the brain represents the identity of complex objects is a central challenge of visual neuroscience. The principles governing object processing have been extensively studied in the macaque face patch system, a sub-network of inferotemporal (IT) cortex specialized for face processing. A previous study reported that single face patch neurons encode axes of a generative model called the "active appearance" model, which transforms 50D feature vectors separately representing facial shape and facial texture into facial images. However, a systematic investigation comparing this model to other computational models, especially convolutional neural network models that have shown success in explaining neural responses in the ventral visual stream, has been lacking. Here, we recorded responses of cells in the most anterior face patch anterior medial (AM) to a large set of real face images and compared a large number of models for explaining neural responses. We found that the active appearance model better explained responses than any other model except CORnet-Z, a feedforward deep neural network trained on general object classification to classify non-face images, whose performance it tied on some face image sets and exceeded on others. Surprisingly, deep neural networks trained specifically on facial identification did not explain neural responses well. A major reason is that units in the network, unlike neurons, are less modulated by face-related factors unrelated to facial identification, such as illumination.

Keywords: computational model; electrophysiology; face processing; inferotemporal cortex; neural coding; primate vision.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

Figures

Figure 1.
Figure 1.. Stimulus and analysis paradigm.
A, 2100 facial photos from multiple face databases were used in this experiment. Three examples are shown. B, Images were presented to the animal while recording from the most anterior face patch AM (anterior medial face patch). The electrode track targeting AM is shown in coronal MRI slices from two animals. C, Each facial image was analyzed using 9 different models. The same number of features were extracted from units of different models using principal component analysis (PCA) for comparison. D, Different models were compared with respect to how well they could predict neuronal responses to faces. A 10-fold cross-validation paradigm was employed for quantification: 2100 faces were evenly distributed into 10 groups. Responses of each neuron to 9 groups were fit by linear regression using features of a particular face model, and the responses of this neuron to the remaining 210 faces were predicted using the same linear transform. To quantify prediction accuracy, we compared predicted responses to individual faces in the space of population responses to either the actual response to that face or that to a distractor face. If the angle between predicted response and target response was smaller than that between predicted response and distractor response, the prediction was considered correct. All pairs of faces were used as both target and distractor and the proportion of correct predictions was computed.
Figure 2.
Figure 2.. Comparing how well different models of face coding can explain AM neuronal responses to facial images.
A, For each model, 50 features were extracted using PCA and used to predict responses of AM neurons. Upper: Explained variances are plotted for each model. For each neuron, explained variance was normalized by the noise ceiling of that neuron (see STAR Methods). Error-bars represent s.e.m. for 148 cells. CORnet-Z performed significantly better than the other models (p<0.001 in all cases except from the 2D Morphable Model, p<0.01 between CORnet-Z and the 2D Morphable Model, Wilcoxon signed-rank test), and the 2D Morphable Model performed significantly better than the remaining models (p<0.01). Lower: Encoding errors are plotted for each model. Error-bars represent s.e.m. for 2100 target faces (i.e., error was computed for each target face when comparing to 2099 distractors, and s.e.m was computed for the 2100 errors). CORnet-Z performed significantly better than the other models (p<0.001 in all cases except from the 2D Morphable Model, p<0.01 between CORnet-Z and the 2D Morphable Model, Wilcoxon signed-rank test), and the 2D Morphable Model performed significantly better than the remaining models (p<0.001). B, To remove differences between models arising from differential encoding of image background, face images with uniform background were presented to different models (see STAR Methods). CORnet-Z and 2D Morphable Model performed significantly better than the other models (p<0.001), with no significant difference between the two models (p=0.30 for explained variance; p=0.79 for encoding error). C, To create facial images without hair, each facial image in the database was fit using a 3D Morphable Model (left). The fits were used as inputs to each model. For example, a new 2D Morphable Model was constructed by morphing the fitted images to an average shape. 50 features were extracted from each of the models using PCA for comparison. D, Same as C, but for 110 features. For 50 features, the 2D Morphable Model performed significantly better than the other models (p<0.001), while there was no significant difference between 3D Morphable Model and CORnetZ (p=0.19 for explained variance; p=0.21 for encoding error) or between 3D Morphable Model and AlexNet (p=0.56 for explained variance; p=0.06 for encoding error). For 110 features, the 3D Morphable Model outperformed all other models (p<0.01 between 2D Morphable Model and 3D Morphable Model for explained variance; p<0.001 in all other cases). Also see Figures S1, S2, and S3.
Figure 3.
Figure 3.. Comparing how well AM neuronal responses to facial images can explain different models of face coding.
A, For each model, 50 features were extracted using PCA and responses of AM neurons were used to predict the model features. Decoding errors are plotted for each model. Error-bars represent s.e.m for 2100 target faces (i.e., error was computed for each target face when comparing to 2099 distractors, and s.e.m was computed for the 2100 errors). CORnet-Z performed significantly better than the other models (p<0.001) except from the 2D Morphable Model (p=0.08, Wilcoxon signed-rank test), and the 2D Morphable Model performed significantly better than the remaining models (p<0.01). B, To remove differences between models arising from differential encoding of image background, face images with uniform background were presented to different models (see STAR Methods). CORnet-Z and 2D Morphable Model performed significantly better than the other models (p<0.001), with only a small difference between the two models (p=0.03). C, To create facial images without hair, each facial image in the database was fit using a 3D Morphable Model (left). The fits were used as inputs to each model. For example, a new 2-D Morphable Model was constructed by morphing the fitted images to an average shape. 50 features were extracted from each of the models using PCA for comparison. D, Same as C, but for 110 features. For 50 features, the 2D Morphable Model and 3D Morphable model performed significantly better than the other models (p<0.001), with no significant difference between the two models (p=0.42). For 110 features, the 3D Morphable Model outperformed all other models (p<0.001). Also see Figures S1, S2, and S3.
Figure 4.
Figure 4.. Measuring neural variance uniquely explained by the 2D Morphable Model and other models
Fifty model features from the 2D Morphable Model and a different model were concatenated, and neural responses were fit using all 100 features as regressors. The explained variance was then subtracted by the contribution from each individual model, to quantify the efficacy of the non-overlapping components of the two models in predicting neural responses. A, Percentage of neural variance uniquely explained by various models compared to the 2D Morphable Model, for images after background removal (cf. Figure 2B). B, Percentage of neural variance uniquely explained by the 2D Morphable Model compared to other models. C and D, same as A and B, but for images fit by the 3D Morphable Model (cf. Figure 2C). Error-bars represent s.e.m. for 148 cells. See Figure S3G-H for a layer-wise analysis of neural variance uniquely explained by AlexNet and CORnets compared to the 2D Morphable Model and vice versa. Also see Figure S4.
Figure 5.
Figure 5.. Vgg-face features and AlexNet features show a marked difference in coding illumination levels.
A, Similarity matrices were computed for 913 faces from CAS-PEAL database using AM population responses (A1) and features of two network models, AlexNet (A2) and Vgg-face (A3). Each entry indicates the correlation between representations of two faces. The difference between the two matrices derived from the network models was computed (A4). Rows and columns of the differential matrix were shuffled according to the first principal component of the difference matrix (A5). The red squares outline face pairs taken from the first and last 100 faces: these face pairs showed a significantly higher representational similarity by Vgg-face compared to AlexNet. B, First 100 faces and last 100 faces along the direction of PC1 were divided into 20 groups of 10 faces. An average face after shape normalization was generated for each group. Also see Figure S5.

Similar articles

Cited by

References

    1. Freiwald WA, and Tsao DY (2010). Functional Compartmentalization and Viewpoint Generalization Within the Macaque Face-Processing System. Science 330, 845–851. - PMC - PubMed
    1. Cootes TF, Edwards GJ, and Taylor CJ (2001). Active appearance models. Ieee T Pattern Anal 23, 681–685.
    1. Edwards GJ, Taylor CJ, and Cootes TF (1998). Interpreting face images using Active Appearance Models. Automatic Face and Gesture Recognition - Third Ieee International Conference Proceedings, 300–305.
    1. Chang L, and Tsao DY (2017). The Code for Facial Identity in the Primate Brain. Cell 169, 1013–1028 e1014. - PMC - PubMed
    1. Parkhi OM, Vedaldi A, and Zisserman A (2015). Deep Face Recognition. British Machine Vision Conference 2015.

Publication types

LinkOut - more resources