Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 21 (19), 1641-6

Reconstructing Visual Experiences From Brain Activity Evoked by Natural Movies


Reconstructing Visual Experiences From Brain Activity Evoked by Natural Movies

Shinji Nishimoto et al. Curr Biol.


Quantitative modeling of human brain activity can provide crucial insights about cortical representations [1, 2] and can form the basis for brain decoding devices [3-5]. Recent functional magnetic resonance imaging (fMRI) studies have modeled brain activity elicited by static visual patterns and have reconstructed these patterns from brain activity [6-8]. However, blood oxygen level-dependent (BOLD) signals measured via fMRI are very slow [9], so it has been difficult to model brain activity elicited by dynamic stimuli such as natural movies. Here we present a new motion-energy [10, 11] encoding model that largely overcomes this limitation. The model describes fast visual information and slow hemodynamics by separate components. We recorded BOLD signals in occipitotemporal visual cortex of human subjects who watched natural movies and fit the model separately to individual voxels. Visualization of the fit models reveals how early visual areas represent the information in movies. To demonstrate the power of our approach, we also constructed a Bayesian decoder [8] by combining estimated encoding models with a sampled natural movie prior. The decoder provides remarkable reconstructions of the viewed movies. These results demonstrate that dynamic brain activity measured under naturalistic conditions can be decoded using current fMRI technology.

Conflict of interest statement

The authors declare no conflict of interest.


Figure 1
Figure 1. Schematic diagram of the motion-energy encoding model
A, Stimuli first pass through a fixed set of nonlinear spatio-temporal motion-energy filters (shown in detail in panel B), and then through a set of hemodynamic response filters fit separately to each voxel. The summed output of the filter bank provides a prediction of BOLD signals. B, The nonlinear motion-energy filter bank consists of several filtering stages. Stimuli are first transformed into the Commission internationale de l'éclairage (CIE) L*A*B* color space and the color channels are stripped off. Luminance signals then pass through a bank of 6,555 spatio-temporal Gabor filters differing in position, orientation, direction, spatial and temporal frequency (see Supplemental Information for details). Motion energy is calculated by squaring and summing Gabor filters in quadrature. Finally, signals pass through a compressive nonlinearity and are temporally down-sampled to the fMRI sampling rate (1 Hz).
Figure 2
Figure 2. The directional motion-energy model capture motion information
A, (top) The static encoding model includes only Gabor filters that are not sensitive to motion. (bottom) Prediction accuracy of the static model is shown on a flattened map of the cortical surface of one subject (S1). Prediction accuracy is relatively poor. B, The non-directional motion-energy encoding model includes Gabor filters tuned to a range of temporal frequencies, but motion in opponent directions is pooled. Prediction accuracy of this model is better than the static model. C, The directional motion-energy encoding model includes Gabor filters tuned to a range of temporal frequencies and directions. This model provides the most accurate predictions of all models tested. D and E, Voxel-wise comparisons of prediction accuracy between the three models. The directional motion-energy model performs significantly better than the other two models, although the difference between the non-directional and directional motion models is small. See also Figure S1 for subject- and area-wise comparisons. F, The spatial receptive field of one voxel (left), and its spatial and temporal frequency selectivity (right). This receptive field is located near the fovea, and it is high-pass for spatial frequency and low-pass for temporal frequency. This voxel thus prefers static or slow speed motion. G, Receptive field for a second voxel. This receptive field is located lower periphery, and it is band-pass for spatial frequency and high-pass for temporal frequency. This voxel thus prefers higher speed motion than the voxel in F. H, Comparison of retinotopic angle maps estimated using (top) the motion-energy encoding model and (bottom) conventional multi-focal mapping on a flattened cortical map [47]. The angle maps are similar, even though they were estimated using independent data sets and methods. I, Comparison of eccentricity maps estimated as in panel H. The maps are similar except in the far periphery where the multi-focal mapping stimulus was coarse. J, Optimal speed projected on to a flattened map as in panel H. Voxels near the fovea tend to prefer slow speed motion, while those in the periphery tend to prefer high speed motion. See also Figure S1B for subject-wise comparisons.
Figure 3
Figure 3. Identification analysis
A, Identification accuracy for one subject (S1). The test data in our experiment consisted of 486 volumes (seconds) of BOLD signals evoked by the test movies. The estimated model yielded 486 volumes of BOLD signals predicted for the same movies. The brightness of the point in the mth column and nth row represents the log-likelihood (see Supplemental Information) of the BOLD signals evoked at the mth second given the BOLD signal predicted at the nth second. The highest log-likelihood in each column is designated by a red circle and thus indicates the choice of the identification algorithm. B, Temporal offset between the correct timing and the timing identified by the algorithm, for the same subject shown in panel A. The algorithm was correct to within ± one volume (second) 95% of the time (464/486); chance performance is less than 1% (3/486; i.e., three volumes centered at the correct timing). C, Scaling of identification accuracy with set size. To understand how identification accuracy scales with size of stimulus set we enlarged the identification stimulus set to include additional stimuli drawn from a natural movie database (but not actually used in the experiment). For all three subjects identification accuracy (within ± one volume) is greater than 75% even when set of potential movies includes one million clips. This is far above chance (gray dashed line).
Figure 4
Figure 4. Reconstructions of natural movies from BOLD signals
A, First row: Three frames from a natural movie used in the experiment, taken one second apart. Second through sixth rows: frames from the five clips with the highest posterior probability. The maximum a posteriori (MAP) reconstruction is shown in row two. Seventh row: The averaged high posterior (AHP) reconstruction. The MAP provides good reconstruction of the second and third frames, while AHP provide more robust reconstructions across frames. B and C, Additional examples of reconstructions, format same as in panel A. D, Reconstruction accuracy (correlation in motion-energy; see Supplemental Information) for all three subjects. Error bars indicate ± 1 s.e.m. across one-second clips. Both the MAP and AHP reconstructions are significant, though the AHP reconstructions are significantly better than the MAP reconstructions. Dashed lines show chance performance (P=0.01). See also Figure S2.

Similar articles

See all similar articles

Cited by 143 articles

See all "Cited by" articles

Publication types