Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 19;118(3):e2014196118.
doi: 10.1073/pnas.2014196118.

Unsupervised neural network models of the ventral visual stream

Affiliations
Free PMC article

Unsupervised neural network models of the ventral visual stream

Chengxu Zhuang et al. Proc Natl Acad Sci U S A. .
Free PMC article

Abstract

Deep neural networks currently provide the best quantitative models of the response patterns of neurons throughout the primate ventral visual stream. However, such networks have remained implausible as a model of the development of the ventral stream, in part because they are trained with supervised methods requiring many more labels than are accessible to infants during development. Here, we report that recent rapid progress in unsupervised learning has largely closed this gap. We find that neural network models learned with deep unsupervised contrastive embedding methods achieve neural prediction accuracy in multiple ventral visual cortical areas that equals or exceeds that of models derived using today's best supervised methods and that the mapping of these neural network models' hidden layers is neuroanatomically consistent across the ventral stream. Strikingly, we find that these methods produce brain-like representations even when trained solely with real human child developmental data collected from head-mounted cameras, despite the fact that these datasets are noisy and limited. We also find that semisupervised deep contrastive embeddings can leverage small numbers of labeled examples to produce representations with substantially improved error-pattern consistency to human behavior. Taken together, these results illustrate a use of unsupervised learning to provide a quantitative model of a multiarea cortical brain system and present a strong candidate for a biologically plausible computational theory of primate sensory learning.

Keywords: deep neural networks; unsupervised algorithms; ventral visual stream.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
Improved representations from unsupervised neural networks based on deep contrastive embeddings. (A) Schematic for one high-performing deep contrastive embedding method, the LA algorithm (41). In LA, all images were embedded into a lower-dimensional space by a DCNN, which was optimized to minimize the distance to close points (blue dots) and to maximize the distance to the “farther” points (black dots) for the current input (red dot). (B) (Left) Change in the embedding distribution before and after training. For each image, cosine similarities to others were computed and ranked; the ranked similarities were then averaged across all images. This metric indicates that the optimization encourages local clustering in the space, without aggregating everything. (Right) Average neighbor-embedding “quality” as training progresses. Neighbor-embedding quality was defined as the fraction of 10 closest neighbors of the same ImageNet class label (not used in training). (C) Top four closest images in the embedding space. Top three rows show the images that were successfully classified using a weighted K-nearest-neighbor (KNN) classifier in the embedding space (K = 100), while Bottom three rows show unsuccessfully classified examples (G means ground truth, P means prediction). Even when uniform distance in the unsupervised embedding does not align with ImageNet class (which itself can be somewhat arbitrary given the complexity of the natural scenes in each image), nearby images in the embedding are nonetheless related in semantically meaningful ways. (D) Visualizations of local aggregation embedding space using the multidimensional scaling (MDS) method. Classes with high validation accuracy are shown at Left and low-accuracy classes are shown at Right. Gray boxes show examples of images from a single class (“trombone”) that have been embedded in two distinct subclusters. (E) Transfer performance of unsupervised networks on four evaluation tasks: object categorization, object pose estimation, object position estimation, and object size estimation. Networks were first trained by unsupervised methods and then assessed on transfer performance with supervised linear readouts from network hidden layers (Materials and Methods). Red bars are contrastive embedding tasks. Blue bars are self-supervised tasks. Orange bars are predictive coding methods and AutoEncoder. Brown bar is the untrained model and black bar is the model supervised on ImageNet category labels. Error bars are standard deviations across three networks with different initializations and four train-validation splits. We used unpaired t tests to measure the statistical significance of the difference between the unsupervised method and the supervised model. Methods without any annotations are significantly worse than the supervised model (P<0.05), n.s., insignificant difference; **, significantly better results with 0.001<P<0.01; and ***, significantly better results with P<0.001 (SI Appendix, Fig. S2).
Fig. 2.
Fig. 2.
Quantifying similarity of unsupervised neural networks to visual cortex data. (A) After being trained with unsupervised objectives, networks were run on all stimuli for which neural responses were collected. Network unit activations from each convolutional layer were then used to predict the V1, V4, and IT neural responses with regularized linear regression (51). For each neuron, the Pearson correlation between the predicted responses and the recorded responses was computed on held-out validation images and then corrected by the noise ceiling of that neuron (Materials and Methods). The median of the noise-corrected correlations across neurons for each of several cortical brain areas was then reported. (B) Neural predictivity of the most-predictive neural network layer. Error bars represent bootstrapped standard errors across neurons and model initializations (Materials and Methods). Predictivity of untrained and supervised categorization networks represents negative and positive controls, respectively. Statistical significance of the difference between each unsupervised method and the supervised model was computed through bootstrapping methods. The methods with comparable neural predictivity are labeled with “n.s.,” and other methods without any annotations are significantly worse than the supervised model (P<0.05) (SI Appendix, Fig. S5). (C) Neural predictivity for each brain area from all network layers, for several representative unsupervised networks, including AutoEncoder, colorization, and local aggregation.
Fig. 3.
Fig. 3.
Learning from real-world developmental datastreams. (A) Schematic for VIE method. Frames were sampled into sequences of varying lengths and temporal densities. They were then embedded into lower-dimensional space using static (single image) or dynamic (multiimage) pathways. These pathways were optimized to aggregate the resulting embeddings and their close neighbors (light brown points) and to separate the resulting embeddings and their farther neighbors (dark brown points). (B) Examples from the SAYCam dataset (59), which was collected by head-mounted cameras on infants for 2 h each week between ages 6 and 36 mo. (C) Neural predictivity for models trained on SAYCam and ImageNet. n.s., the difference is not significant (P>0.05). *** and *, significant difference (P = 0.0008 for V4 and P = 0.023 for IT). Error bars represent bootstrapped standard errors across neurons and model initializations. Statistical significance of the difference was computed through bootstrapping methods (SI Appendix, Fig. S12).
Fig. 4.
Fig. 4.
Behavioral consistency and semisupervised learning. (A) In the LLP method (64), DCNNs generated an embedding and a category prediction for each example. The embedding () of an unlabeled input was used to infer its pseudolabel considering its labeled neighbors (colored points) with voting weights determined by their distances from and their local density (the highlighted areas). DCNNs were then optimized with per-example confidence weightings (color brightness) so that its category prediction matched the pseudolabel, while its embedding was attracted toward the embeddings sharing the same pseudolabels and repelled by the others. (B) To measure behavioral consistency, we trained linear classifiers from each model’s penultimate layer on a set of images from 24 classes (21, 49). The resulting image-by-category confusion matrix was compared to data from humans performing the same alternative forced-choice task, where each trial started with a 500-ms fixation point, presented the image for 100 ms, and required the subject to choose from the true and another distractor category shown for 1,000 ms (21, 49). We report the Pearson correlation corrected by the noise ceiling. (C) Example confusion matrices of human subjects and model (LLP model trained with 36,000 labels). Each category had 10 images as the test images for computing the confusion matrices. (D) Behavioral consistency of DCNNs trained by different objectives. Green bars are for semisupervised models trained with 36,000 labels. “Few-Label” represents a ResNet-18 trained on ImageNet with only 36,000 images labeled, the same amount of labels used by MT and LLP models. Error bars are standard variances across three networks with different initializations. (E and F) Behavioral consistency (E) and categorization accuracy in percentage (F) of semisupervised models trained with differing numbers of labels.

Similar articles

Cited by

References

    1. Carandini M., et al. , Do we know what the early visual system does? J. Neurosci. 25, 10577–10597 (2005). - PMC - PubMed
    1. Movshon J. A., Thompson I. D., Tolhurst D. J., Spatial summation in the receptive fields of simple cells in the cat’s striate cortex. J. Physiol. 283, 53–77 (1978). - PMC - PubMed
    1. Majaj N. J., Hong H., Solomon E. A., DiCarlo J. J., Simple learned weighted sums of inferior temporal neuronal firing rates accurately predict human core object recognition performance. J. Neurosci. 35, 13402–13418 (2015). - PMC - PubMed
    1. Yamane Y., Carlson E. T., Bowman K. C., Wang Z., Connor C. E., A neural code for three-dimensional object shape in macaque inferotemporal cortex. Nat. Neurosci. 11, 1352–1360 (2008). - PMC - PubMed
    1. Hung C. P., Kreiman G., Poggio T., Dicarlo J. J., Fast readout of object identity from macaque inferior temporal cortex. Science 310, 863–866 (2005). - PubMed

Publication types

LinkOut - more resources