Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2019 Dec;30:100-108.
doi: 10.1016/j.cobeha.2019.07.004.

Learning to See Stuff

Affiliations
Free PMC article
Review

Learning to See Stuff

Roland W Fleming et al. Curr Opin Behav Sci. .
Free PMC article

Abstract

Materials with complex appearances, like textiles and foodstuffs, pose challenges for conventional theories of vision. But recent advances in unsupervised deep learning provide a framework for explaining how we learn to see them. We suggest that perception does not involve estimating physical quantities like reflectance or lighting. Instead, representations emerge from learning to encode and predict the visual input as efficiently and accurately as possible. Neural networks can be trained to compress natural images or to predict frames in movies without 'ground truth' data about the outside world. Yet, to succeed, such systems may automatically discover how to disentangle distal causal factors. Such 'statistical appearance models' potentially provide a coherent explanation of both failures and successes in perception.

Figures

Figure 1
Figure 1
Learning to see stuff. (a) Substances such as tweed, leather, and scrambled eggs evoke rich material impressions. (b) Physical parameters (here, azimuth and elevation angle) determine the retinal image (‘forward optics’). Neighbouring physical parameters can give rise to wildly different images (the tangled pink grid), and most possible images look like meaningless noise (cyan dots). Unsupervised learning can discover ‘statistical appearance models’, comprising latent variables that efficiently capture the variation among natural images. (c) Deep neural networks can learn powerful latent codes capturing natural image variations. After training to encode 70 000 real human faces from the FFHQ dataset ([75]; https://github.com/NVlabs/ffhq-dataset; images are public domain as defined under creative commons CC0 1.0 license), a network was able to generate completely novel face images such as the nine shown, which do not correspond to any existing person (generated by Jordan Suchow using the PixelVAE network described in Ref. [29••]).
Figure 2
Figure 2
Unsupervised image compression can discover natural material types. Top right: schematic of an autoencoder network trained on images of natural textures. Images are passed through four convolutional layers with successively fewer units, before being expanded back to the original dimensionality. The learning objective is to minimise the pixelwise difference between original and reconstructed images. Bottom left: by applying the dimensionality reduction method tSNE [76] to 3000 images depicting fur, gravel, or wool, we see that these categories are highly intermixed in image space. The tSNE algorithm embeds high-dimensional data into two dimensions for visualisation, while preserving local distances between nearby points as faithfully as possible. Bottom right: when the same algorithm is applied to the representations of the images within the trained autoencoder’s latent code, strong clusters emerge corresponding to the natural material types.
Figure 3
Figure 3
Unsupervised video prediction can discover physical scene properties. A recurrent network of the PredNet architecture [32••] trained to predict the next frame in a simple simulated world of rotating checkered cubes. Deeper layers attempt to predict activation in preceding layers (green feedback arrows), while lower layers send up prediction errors (red feedforward arrows) and each layer propagates its current state to the next time point using LSTM units (purple recurrent arrows). Top right: Visualised activations of individual units in response to three frames of a video (brighter pixel values indicate stronger activation to a location in the frame). The unit visualised in the first row responds almost exclusively to the shadow cast by the object, but not to other shadows in the environment or to dark regions on the object. The unit visualised in the second row responds almost exclusively to moving reflectance edges on the object, but not to moving shadow edges or to still edges.

Similar articles

See all similar articles

Cited by 1 article

References

    1. Adelson E.H. On seeing stuff: the perception of materials by humans and machines. Rogowitz B.E., Pappas T.N., editors. Proceedings SPIE Human Vision and Electronic Imaging VI. 2001;vol 4299:1–12.
    1. Anderson B.L. Visual perception of materials and surfaces. Curr Biol. 2011;21:R978–R983. - PubMed
    1. Zaidi Q. Visual inferences of material changes: color as clue and distraction. Wiley Interdiscip Rev Cogn Sci. 2011;2:686–700. - PMC - PubMed
    1. Fleming R.W. Visual perception of materials and their properties. Vis Res. 2014;94:62–75. - PubMed
    2. This article argued that rather than estimating the physical properties of objects and materials, the visual system may instead infer ‘statistical appearance models’, a perceptual representation describes the ways that the proximal stimulus associated with a given material typically varies.

    1. Fleming R.W. Material perception. Ann Rev Vis Sci. 2017;3:365–388. - PubMed

LinkOut - more resources

Feedback