Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 8;3(1):221.
doi: 10.1038/s42003-020-0945-x.

Convolutional neural networks explain tuning properties of anterior, but not middle, face-processing areas in macaque inferotemporal cortex

Affiliations

Convolutional neural networks explain tuning properties of anterior, but not middle, face-processing areas in macaque inferotemporal cortex

Rajani Raman et al. Commun Biol. .

Abstract

Recent computational studies have emphasized layer-wise quantitative similarity between convolutional neural networks (CNNs) and the primate visual ventral stream. However, whether such similarity holds for the face-selective areas, a subsystem of the higher visual cortex, is not clear. Here, we extensively investigate whether CNNs exhibit tuning properties as previously observed in different macaque face areas. While simulating four past experiments on a variety of CNN models, we sought for the model layer that quantitatively matches the multiple tuning properties of each face area. Our results show that higher model layers explain reasonably well the properties of anterior areas, while no layer simultaneously explains the properties of middle areas, consistently across the model variation. Thus, some similarity may exist between CNNs and the primate face-processing system in the near-goal representation, but much less clearly in the intermediate stages, thus requiring alternative modeling such as non-layer-wise correspondence or different computational principles.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Schema of our investigation to compare the macaque face-processing network and a CNN model.
We simulate previous four experiments (left image sets) on a CNN model (bottom middle) to identify tuning properties (bottom right). We quantitatively compare the tuning properties between each macaque face patch (from the past experiment) and each CNN layer (from the present simulation) to find out their correspondence.
Fig. 2
Fig. 2. View-identity tuning.
Each plot shows the population response similarity matrix for each layer. The pixel values of the matrix indicate the pairwise correlation coefficients (legend) for the population responses to face images. The elements of the matrix are grouped according to the view (indicated by the images along the axes) with the same order of identities in each group.
Fig. 3
Fig. 3. Size invariance.
a Examples of face and non-face object stimuli of various sizes (the numbers in pixels beneath). b The average responses to the face images (red) and to the object images (blue) for each image size (x-axis) in each model layer. The “bline” stands for the average baseline response to the blank image.
Fig. 4
Fig. 4. Shape-appearance tuning.
a Illustration of the first shape and first appearance dimensions for frontal faces. It shows how varying these dimensions changes the image. The shown images correspond to the feature vectors with all zero except the indicated dimension is set to −3, 0, or 3. b The distribution of shape-preference indices for each model layer (blue), in comparison to the corresponding distributions for ML (gray) and AM (pink) estimated using Fig. 1e of the experimental study.
Fig. 5
Fig. 5. View tolerance.
a Illustration of the first shape and first appearance dimensions for profile faces. The images correspond to the feature vectors with all zero except the indicated dimension is set to −3, 0, or 3. b The correlation between the frontal and the profile STAs across units for each dimension (x-axis; 1–25: shape, 26–50: appearance). Each plot compares the results from a model layer (red and blue) and AM (black), the latter replotted from Fig. 6d of the experimental study. The shaded region indicates the 99% confidence interval of randomly shuffled data from the model. c Analogous result for the left half-profile view. d The mean STA correlation (averaged over the feature dimensions) for each non-frontal view (x-axis) against frontal view, for each layer (color; see legend).
Fig. 6
Fig. 6. Facial geometry tuning.
a Examples of cartoon face images varying a feature parameter (inter-eye distance in this case). Generally, each of 19 feature parameters ranges from −5 to +5, where ±5 corresponds to the extreme features and 0 corresponds to the mean features. b The distribution of the number of features that each unit is significantly tuned to. c The distribution of the number of units significantly tuned to each feature. Each plot compares the result from a model layer (blue) with that from ML (gray) replotted from Fig. 3 of the experimental study.
Fig. 7
Fig. 7. Contrast polarity tuning.
a Examples of mosaic-like cartoon face images with various intensity assignment to each face part. The first three have a larger intensity on the forehead than the left eye; the last three have the opposite. b The distribution of contrast polarity preferences in each model layer (blue and red) in comparison to ML (gray) replotted from Fig. 3A of the experimental study. In each plot, the upper half gives the positive polarities (part A > part B), while the lower half gives the negative polarities (part A < part B). The binary table at the bottom indicates the 55 part-pairs; for each pair, the upper black block denotes A and the lower block denotes B.
Fig. 8
Fig. 8. Summary of comparison between layers of AlexNet-Face and face-patches.
a The correlation between the RSM from each layer (Fig. 2) and each face patch (AM/AL/ML; Fig. 4d–f of the corresponding experimental study). Each shaded region shows the ±2SD range of correlations from random cases, i.e., correlations between the experimental RSM and repeatedly generated random RSMs ("Methods"). b The size invariance index for each layer (Fig. 3b) and for face patches (equal for AM/AL/ML; Fig. S10C of the corresponding experimental study). c The mean shape-preference index for each layer (Fig. 4b) compared with the mean indices for AM, ML, and their midpoint (estimated using Fig. 1e of the corresponding experimental study). Each shaded region shows 95% confidence intervals constructed by 200 iterations of bootstrapping on the experimental data ("Methods"). Note that the mean SPIs for layers 1 to 4 exceed this interval for the midpoint. d The mean STA correlation for each layer (Fig. 5b) and AM (Fig. 6D of the corresponding experimental study). The shaded region shows the ±2SD range of mean correlations between random STA vectors for the same population size as each layer (“Methods”). e The cosine similarity between the distributions of the number of tuned features per unit (red) or the number of tuned units per feature (blue) for each layer (Fig. 6) and ML (Fig. 3 of the corresponding experimental study). f The cosine similarity between the distributions of contrast polarity preferences for each layer (Fig. 7b) and ML (Fig. 3A of the corresponding experimental study). In e, f, each shaded region shows the ±2SD range of cosine similarities between the experimental distribution and randomly generated random distributions (“Methods”).
Fig. 9
Fig. 9. Summary of layer-patch comparisons for pre-trained and untrained networks.
The format of each plot (af) is analogous to Fig. 8 (omitting ±2SD regions in d due to the architecture variety). The networks including AlexNet-Face are indicated in different line styles (see the legend at the bottom). In the view-identity tuning plot a, we omit comparison with AL data for visibility. In the size invariance plot b, we slightly shift each curve vertically also for visibility. For VGG-Face, the seven layers were those with the closest receptive field sizes to the corresponding layers in AlexNet.
Fig. 10
Fig. 10. Summary of layer-patch comparisons for different architectures.
The format of each plot (af) is similar to Fig. 8, except that x-axis shows the normalized depth (0 corresponds to the lowest layer and 1 to the highest layer). The architectures differ in the depth or the number of convolution filters, indicated by different colors or line styles; AF-5 to 9 varied the depth; AF-h halved and AF-d doubled the number of filters in each layer (see the legend at the bottom).

Similar articles

Cited by

References

    1. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–444. doi: 10.1038/nature14539. - DOI - PubMed
    1. Cadieu CF, et al. Deep neural networks rival the representation of primate IT cortex for core visual object recognition. PLoS Comp. Biol. 2014;10:e1003963. doi: 10.1371/journal.pcbi.1003963. - DOI - PMC - PubMed
    1. Yamins DLK, et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl Acad. Sci. USA. 2014;111:8619–8624. doi: 10.1073/pnas.1403112111. - DOI - PMC - PubMed
    1. Khaligh-Razavi S-M, Kriegeskorte N. Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Comp. Biol. 2014;10:e1003915. doi: 10.1371/journal.pcbi.1003915. - DOI - PMC - PubMed
    1. Güçlü U, van Gerven MAJ. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. J. Neurosci. 2015;35:10005–10014. doi: 10.1523/JNEUROSCI.5023-14.2015. - DOI - PMC - PubMed

Publication types