Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 47 (1), 129-41

Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal

Affiliations

Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal

Hannah M Bayer et al. Neuron.

Abstract

The midbrain dopamine neurons are hypothesized to provide a physiological correlate of the reward prediction error signal required by current models of reinforcement learning. We examined the activity of single dopamine neurons during a task in which subjects learned by trial and error when to make an eye movement for a juice reward. We found that these neurons encoded the difference between the current reward and a weighted average of previous rewards, a reward prediction error, but only for outcomes that were better than expected. Thus, the firing rate of midbrain dopamine neurons is quantitatively predicted by theoretical descriptions of the reward prediction error signal used in reinforcement learning models for circumstances in which this signal has a positive value. We also found that the dopamine system continued to compute the reward prediction error even when the behavioral policy of the animal was only weakly influenced by this computation.

Figures

Figure 1
Figure 1. Saccade Timing Task
(Left) The events of an individual trial as a function of time. Animals were reinforced for executing a saccade during an unsignaled temporal window. The height of the reinforcement cartoon indicates that there were five intervals during this window, each of which was associated with an increasing reward size. The delay before the rewarded temporal window was shifted between blocks of trials without cuing the animals. (Right) The spatial configuration of the fixation and target light-emitting diodes.
Figure 2
Figure 2. Animals Choose Saccadic Latencies for which They Will Be Reinforced
(A) Saccadic latencies plotted sequentially from a single behavioral session. The gray squares represent the extent of the rewarded temporal window. (B) Average saccadic latency for the last fifty trials of each block, plotted as a function of the time at which the interval associated with the largest reward size began (error bars showing standard deviation fit inside the points). (C) Log of the change in reaction time from the current to the next trial (ΔRT) plotted as a function of the log of the difference between the reaction time on the current trial and the reaction time that would have provided the largest volume of juice (RT error). Includes only trials from blocks in which the best reaction time was the earliest one the monkeys ever experienced (including the data used to compute the average point labeled “C”). Rewarded trials are in dark gray, unrewarded trials are in light gray, and mean and standard error are plotted in black. (D) ΔRT plotted as a function of RT error, including only trials from blocks in which the best reaction time was the latest one the monkeys ever experienced (including the data used to compute the average point labeled “D”). Rewarded trials are in dark gray, unrewarded trials are in light gray, and mean and standard error are plotted in black.
Figure 3
Figure 3. Dopamine Neurons Show a Characteristic Response to Unpredicted Rewards
(A) The average response of a single dopamine neuron to the delivery of an unpredicted reward, aligned to the time of reward delivery. Error bars are standard deviation of the mean, and time bins are 20 ms long (n = 10 trials). (B) The distribution of average firing rates during a 150 ms interval starting 75 ms after the delivery of an unpredicted reward, shown in gray in (A), for all neurons in the population. Mean = 26 Hz; standard deviation = 9 Hz (n = 46). (C) Histological localization of a subset of dopamine neurons from this report. Circles are locations of marking lesions placed at the location of recorded neurons, and dashes are estimated locations for additional neurons (where no lesions were made). Distortions in these drawings accurately reflect a significant distortion of the anatomy observed following the perfusion process. The animal had suffered from a blockage of the lateral ventricle during the period in which the marking lesions were made. As a result, the tissue was not sliced in the vertical plane, and sections differ significantly from canonical images.
Figure 4
Figure 4. Responses of a Dopamine Neuron during the Saccade Timing Task
(Left) Average response of the neuron aligned to the auditory tone that initiated the trial; error bars represent standard error. (Right) Average response of the neuron aligned to the time of reward delivery; error bars represent standard error. Plotted above the averages are a randomly selected subset of 40 trials from each condition as examples of the raw data that were used to compute the averages. (Both graphs) In black are trials in which there was a large difference between the size of the reward delivered during the trial and the size of the reward during the previous trial (n = 300). In gray are trials in which there was a small difference in size between the current and previous rewards (n = 289). For reference, the legend underneath represents the events of the trial as a function of time.
Figure 5
Figure 5. Multiple Linear Regression of Neuronal Firing Rate and Reward History: Single Neuron
(A) Coefficients from multiple linear regression for a single neuron (L041103). (Inset) Last ten coefficients plotted as they would be used to compute a weighted average. Each one is divided by the value of the first coefficient. Error bars represent the 95% confidence intervals. R-squared = 0.50; p < 0.00001; n = 1007 trials. (B) Firing rate plotted as a function of weighted reward history. Weighted reward history computed using the coefficients shown in (A) after they have been normalized by dividing all coefficients by the value of the first. Error bars represent standard error. (C) Coefficients from multiple linear regression for a single neuron (C032504). (Inset) Last ten coefficients plotted as they would be used to compute a weighted average. Error bars represent the 95% confidence intervals. R-squared = 0.42; p < 0.00001; n = 295 trials. (D) Firing rate plotted as a function of weighted reward history. Weighted reward history computed using normalized regression coefficients shown in (C). Error bars represent standard error. (E) Coefficients from multiple linear regression for all neurons combined. (Inset) Last ten coefficients plotted as they would be used to compute a weighted average. Error bars represent the 95% confidence intervals. R-squared = 0.21; p < 0.0001; n = 13919 trials. (F) Firing rate plotted as a function of weighted reward history. Weighted reward history computed using normalized regression coefficients shown in (E). Error bars represent standard error.
Figure 6
Figure 6. Neuronal Firing Rates Are Better Correlated with Reward History When Firing Rates Are above Baseline, and When There Is Low Correlation between Sequential Rewards
(A) Coefficients from multiple linear regression for all cells. Plotted in black are the results of the regression including only trials with firing rate above baseline (R-squared = 0.16; p < 0.00001; n = 10449 trials). Plotted in gray are the results of the regression including only trials below baseline (R-squared = 0.03; p < 0.00001; n = 3966 trials). Error bars represent the 95% confidence intervals. (B) Coefficients from multiple linear regression for all cells. Plotted in black are the results of the regression including only the first 20 trials of each block (R-squared = 0.32; p < 0.00001; n = 3100 trials). Plotted in gray are the results of the regression including only the last 20 trials of each block (R-squared = 0.09; p < 0.00001; n = 3180 trials). Error bars represent the 95% confidence intervals.
Figure 7
Figure 7. Neuronal Responses Do Not Encode Temporal Properties of the Saccade
(A) R-squared values for individual neurons from regressions with the temporal interval in which the saccade was executed added as an additional variable to the regression plotted as a function of the R-squared values for the regression using reward history only. (B) R-squared values for individual neurons from regressions using only saccades with the same temporal interval in which the saccade was executed to compose the reward history plotted as a function of the R-squared values for the regression using a sequential reward history.
Figure 8
Figure 8. Monkeys Use Information to Perform the Delayed Saccade Task that Is Not Encoded in Neuronal Firing Rates
(All plots) Rewarded trials are in dark gray, and unrewarded trials are in light gray. Mean and standard error are plotted in black. (A) Reward prediction error plotted as a function of RT error for all trials in which the target reaction time was the earliest one the monkeys had ever experienced. (B) Change in reaction time (ΔRT) plotted as a function of reward prediction error for all trials in which the target reaction time was the earliest one the monkeys had ever experienced. (C) Reward prediction error plotted as a function of RT error for all trials in which the target reaction time was the latest one the monkeys had ever experienced. (D) Change in reaction time (ΔRT) plotted as a function of reward prediction error for all trials in which the target reaction time was the latest one the monkeys had ever experienced. (E) Coefficients from multiple linear regression for all cells. Plotted in dark gray are the results of the regression including only rewarded trials (R-squared = 0.43; p < 0.00001; n = 12016 trials). Plotted in light gray are the results of the regression including only unrewarded trials (R-squared = 0.22; p < 0.00001; n = 2399 trials). Error bars represent the 95% confidence intervals.
Figure 9
Figure 9. Theoretical Reward Prediction Error Computations
Value of reward prediction error (α[Rt − Vt − 1]) computed using the following equation: Vt = Vt − 1 + α[Rt − Vt − 1]. Plotted in black, α = 0.5; plotted in gray, α = 0.7. A unit value reward has been simulated during trial t, no rewards for the next ten trials.

Similar articles

See all similar articles

Cited by 418 articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback