Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan 17;97(2):450-461.e9.
doi: 10.1016/j.neuron.2017.12.007. Epub 2017 Dec 28.

Exploration Disrupts Choice-Predictive Signals and Alters Dynamics in Prefrontal Cortex

Affiliations

Exploration Disrupts Choice-Predictive Signals and Alters Dynamics in Prefrontal Cortex

R Becket Ebitz et al. Neuron. .

Erratum in

Abstract

In uncertain environments, decision-makers must balance two goals: they must "exploit" rewarding options but also "explore" in order to discover rewarding alternatives. Exploring and exploiting necessarily change how the brain responds to identical stimuli, but little is known about how these states, and transitions between them, change how the brain transforms sensory information into action. To address this question, we recorded neural activity in a prefrontal sensorimotor area while monkeys naturally switched between exploring and exploiting rewarding options. We found that exploration profoundly reduced spatially selective, choice-predictive activity in single neurons and delayed choice-predictive population dynamics. At the same time, reward learning was increased in brain and behavior. These results indicate that exploration is related to sudden disruptions in prefrontal sensorimotor control and rapid, reward-dependent reorganization of control dynamics. This may facilitate discovery through trial and error.

Keywords: attention; control dynamics; decision-making; exploration; frontal eye fields; goal states; indeterminacy; learning; prefrontal cortex; sensorimotor control.

PubMed Disclaimer

Conflict of interest statement

Declaration of Interests:

The authors declare no competing interests.

Figures

Figure 1
Figure 1. Task design and goal state identification
A) The task (top) was to choose between three probabilistically rewarded targets, one of which was placed in the receptive field of an FEF neuron (dotted circle). Bottom: Reward probabilities (lines) and choices (dots) for 200 example trials. Gray bars highlight explore-labeled choices. B) The distribution of times between switch decisions (inter-switch-intervals). A single probability of switching or continuous range of switch probabilities would produce exponentially distributed inter-switch intervals. Dotted black line: the maximum likelihood fit for a single discrete exponential distribution. Solid blue line: a mixture of two exponential distributions, with each component distribution in dotted blue. The two components reflect one fast-switching time constant (average interval: 1.6 trials) and one persistent time constant (17.2 trials). Inset) The log likelihood of mixtures of 1 to 4 exponential distributions. See also figure S1. C) A hidden Markov model, based on the different time constants for switching, was used to infer the goal state on each trial from the sequence of choices. The model included one persistent state for each target (“exploit”) and one state where the subjects’ were equally likely to choose any of the three targets (“explore”).
Figure 2
Figure 2. Features of explore and exploit-labeled choices
A) The reward history filter preceding transitions into explore states (Wiener kernel analysis). B) Choice as a function of true reward probability for explore and exploit choices. X axis: Difference between each target and the mean of the alternatives. C) Difference in reaction time and peak velocity between explore and exploit choices. D) The probability that the monkeys would switch targets on the next trial, given this trials’ outcome and goal state. Inset: Exploit choices enlarged to show error bars. E) The effect of past reward outcomes on switch decisions as a function of time since the outcome (x-axis) and state at the time of the outcome (colors). * p < 0.05, paired t-test, n = 28 sessions. Data is normalized for illustration only, statistics were run on non-normalized data.
Figure 3
Figure 3. Target selectivity during exploration in single units
A) A single neuron from monkey O. The cartoon illustrates the relative positions of the RF target (Tin, red) and the two non-RF targets (Tout, blue). Target selective firing rate measured during exploit choices (solid lines) and explore choices (dotted lines). B) Same as A, but for a multi-unit recorded in monkey B. C) Target selectivity across the population of recorded units (n = 574) during exploit choices (top), and explore choices (bottom). Red = Tin, blue = ipsilateral Tout, green = contralateral Tout. D) The target selectivity index averaged over all single neurons (monkey O, n = 83; monkey B, n = 48), plotted across time. Inset: Firing rate was suppressed for Tin choice and increased for Tout choice. Bottom: Difference in the target selectivity index between explore and exploit, averaged over single neurons. Thick lines in both top and bottom: significant difference from 0 in that epoch, p < 0.05, n = 131, shading: ± S.E.M. See also table S3, figure S5.
Figure 4
Figure 4. Dynamics of population target selectivity
A) Targeted dimensionality reduction. Choice-separating hyperplanes (black arrows: linear combinations of neuronal firing rates) were identified with multinomial logistic regression. Single trial neural activity was projected into the subspace defined by these hyperplanes (gray plane). Middle panel: The distribution of whole-trial positions in the subspace from one example session. Each marker indicates the position of one trial, colored according to whether target 1 (green), 2 (blue), or 3 (red) was chosen. d is the Euclidean distance between two trials in this subspace. Left: The scatter index (top) is a measure of clustering in the choice-predictive subspace. The two highlighted trials are example trials in which target 1 was chosen that have high scatter index (left) and low scatter index (right), respectively. B) Example neural trajectories in the choice-predictive subspace. Top: Because logistic regression was used to calculate the separating hyperplanes, the vectors perpendicular to the axes (colored arrows) reflect increasing confidence that the monkey will make that decision. Bottom left: Average neural trajectories during exploit trials from the example session. Saturated colors = average across all exploit choices. Desaturated = 4 random samples matched to number of explore choices. Bottom right: Trajectories during explore choices. C) Average scatter index for explore and exploit choices in each session. All sessions are above the unity line. Dark gray = individually significant sessions. D) Evolution of the scatter index during the example session, during explore (purple) and exploit (black) choices. E) Same as D, averaged across sessions. F) The difference in within-choice trajectory distance between explore choices and exploit choices, averaged across sessions. Thick lines indicate significant difference from 0 (corrected p < 0.05, rank sum). G) Between-choice divergence in neural trajectories. Exponential model fits overlaid. Shading: ± S.E.M., n = 28 sessions throughout. See also table S3, figures S4–5.
Figure 5
Figure 5. Target selectivity across trials relative to explore transitions
A) Average scatter index on trials before, during, and after exploration from an example session. Lines = GLM fits to the scatter index before and after exploration. Bars ± SEM throughout, * p < 0.05, n = 28 sessions. B) Same as A, across sessions. C) Residual spike count autocorrelation for exploit trials that were (light gray) or were not (dark gray) separated by exploration. Lags at < 2 were not possible for explore-separated trials. Lines represent polynomial fit (order = number of lags ÷ 2), shading ± SEM of the fit. Solid lines along the bottom are significant bins, bootstrapped, p < 0.05, corrected, n = 514 units. D) Scatter index during the first 5 exploit trials following an explore, combined across sessions as a function of both trials since exploration and the rewards accumulated since exploration. Trial counts in each bin are overlaid. E) The difference in the scatter index between trials where reward was received on the last trial and when it was not, separated according to time since exploration, n = 28 sessions.

Similar articles

Cited by

References

    1. Anderson BA, Laurent PA, Yantis S. Value-driven attentional capture. Proceedings of the National Academy of Sciences of the United States of America. 2011;108:10367–10371. - PMC - PubMed
    1. Armstrong KM, Chang MH, Moore T. Selection and maintenance of spatial information by frontal eye field neurons. The Journal of Neuroscience. 2009;29:15621–15629. - PMC - PubMed
    1. Bair W, Zohary E, Newsome WT. Correlated firing in macaque visual area MT: time scales and relationship to behavior. Journal of Neuroscience. 2001;21:1676–1697. - PMC - PubMed
    1. Barbas H, Mesulam MM. Organization of afferent input to subdivisions of area 8 in the rhesus monkey. Journal of Comparative Neurology. 1981;200:407–431. - PubMed
    1. Barger KJA. Masters. Cornell University; 2006. Mixtures of exponential distributions to describe the distribution of Poisson means in estimating the number of unobserved classes.

Publication types

LinkOut - more resources