Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Oct 6;30(40):13326-37.
doi: 10.1523/JNEUROSCI.6249-09.2010.

Functional requirements for reward-modulated spike-timing-dependent plasticity

Affiliations

Functional requirements for reward-modulated spike-timing-dependent plasticity

Nicolas Frémaux et al. J Neurosci. .

Abstract

Recent experiments have shown that spike-timing-dependent plasticity is influenced by neuromodulation. We derive theoretical conditions for successful learning of reward-related behavior for a large class of learning rules where Hebbian synaptic plasticity is conditioned on a global modulatory factor signaling reward. We show that all learning rules in this class can be separated into a term that captures the covariance of neuronal firing and reward and a second term that presents the influence of unsupervised learning. The unsupervised term, which is, in general, detrimental for reward-based learning, can be suppressed if the neuromodulatory signal encodes the difference between the reward and the expected reward-but only if the expected reward is calculated for each task and stimulus separately. If several tasks are to be learned simultaneously, the nervous system needs an internal critic that is able to predict the expected reward for arbitrary stimuli. We show that, with a critic, reward-modulated spike-timing-dependent plasticity is capable of learning motor trajectories with a temporal resolution of tens of milliseconds. The relation to temporal difference learning, the relevance of block-based learning paradigms, and the limitations of learning with a critic are discussed.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Learning spike train responses with reward-modulated STDP. A, Reward-modulated STDP. Depending on the relative timing of presynaptic and postsynaptic spikes (blue, pre-before-post; red, post-before-pre), candidate changes, eij, in synaptic weight arise. They decay unless they are permanently imprinted in the synaptic weights, wij, by a success signal, S(R). The sign of both the candidate change eij and the success signal S(R) affect the sign of the actual weight change (i.e., if both are negative, the weight change is positive). B, Learning task. In each trial, the same input spike pattern (left) is presented to the network. The output spike trains (right, black) of five postsynaptic neurons are compared with five target spike trains (right, red), yielding a set of neuron-specific scores Rni, which are averaged over all output neurons to yield a global reward signal Rn. The success signal S(Rn), which triggers synaptic plasticity, is a function of the global reward Rn. C, Learning of the target spike train by one of the output neurons. The target spike times are shown in light red, and the actual spike times of the output neuron are indicated by colored spike trains. Each line corresponds to a different trial at the beginning (magenta), middle (green), and end(yellow) of learning. The individual scores Rni for the neuron are indicated on the right (higher values represent better learning). D, Learning curve. Evolution of the reward Rn (gray dots, only 25% shown for clarity) during a learning episode (R-STDP; one single output pattern was learned). The vertical color bars match the trials shown in C. The black curve shows the averaged score Rn, which is used to calculate the success signal S(R) = Rnn, shown at the bottom. The dotted line shows performance before learning and the dashed line represents the performance of the reference weights (see Materials and Methods), indicating a good performance.
Figure 2.
Figure 2.
R-STDP, unlike R-max, is sensitive to an offset in the success signal S(R) = Rnn + C. A, Effect of a success signal offset on learning performance. The reward obtained after several thousand trials (vertical axis) is shown as a function of the success offset C, given in units of the SD σR of the reward before learning. Filled circles, R-max (red) is robust to success offsets, whereas R-STDP (blue) fails with even small offsets. The performance of R-STDP for negative offsets C drops below the performance before learning (dotted line). Empty circles, Reducing the learning rate (η = 1 → η = 0.06) and allowing neurons to learn for more trials (Ntrials = 5000 → Ntrials = 80,000) compensates the effect of the offset for R-max, but does not significantly improve the performance of R-STDP. Averages are for 20 different pattern sets. Error bars show SD. B, C, Nonzero success offsets bias R-STDP toward unsupervised learning. Latency of the first output spike versus latency of the first target spike, pooled over input patterns and output neurons, is shown for R-STDP (B) and R-max (C). If learning succeeds, both values match (gray diagonal line). This is the case for R-max (C) and unbiased R-STDP (B, blue dots), but R-STDP with nonzero success offset shows the behavior of the unsupervised rule: postsynaptic neurons fire earlier than the target for C > 0 (B, green dots) and later for C < 0 (B, red dots). D, E, R-STDP cannot be rescued by weight dependence (D, α = 1, green dots; red and blue dots redrawn from A), nor by variations in the ratio λ of pre-before-post and post-before-pre window size (E). F, Results are not specific to a reward scheme. Same as A, but with a spike count score instead of the spike-timing score. In A, DF, the dotted line shows the performance before learning and the dashed line shows the performance of the reference weights.
Figure 3.
Figure 3.
R-STDP, but not R-max, needs a stimulus-specific reward-prediction system to learn multiple input/output patterns. At each trial, pattern A or B is presented in the input, and the output pattern is compared with the corresponding target pattern. A, R-max can learn two patterns, even when the success signal S(R) for each pattern does not average to zero. Top, Rewards as a function of trial number. Magenta, Pattern A; green, pattern B; black, running trial mean of the reward; dotted line, reward before learning; dashed line, reward obtained with the reference weights (see Materials and Methods). Bottom, Success signals S(R) for stimuli A and B. For clarity, only 25% of the trials are shown. B, R-STDP fails to learn two patterns if the success signal is not stimulus-specific. As long as, by chance, the actual rewards obtained for stimuli A and B are similar [top, first 4000 trials; A (magenta) and B (green) reward values overlap], the mean reward subtraction is correct for both and performance increases. However, as soon as a minor discrepancy in mean reward appears between the two tasks (arrow at ∼4000 trials, magenta above green dots), performance drops to prelearning level (dotted line) and fails to recover. For visual clarity, the figure shows a trial with a relatively late failure. C, R-STDP can be rescued if the success signal is a stimulus-specific reward-prediction error. A critic maintains a stimulus-specific mean-reward predictor (top, dark magenta and dark green lines) and provides the network with unbiased success signals (bottom) for both stimuli. D, Performance as a function of the number of stimuli. A stimulus-specific reward-prediction system makes a significant difference for large numbers of distinct stimulus-response pairs. Filled circles, Success signal based on a simple, stimulus-unspecific trial average; empty circles, stimulus-specific reward-prediction error. R-STDP (blue) fails to learn more than one stimulus/response association without stimulus-specific reward prediction, but performs well in the presence of a critic, staying close to the performance level of the reference weights (dashed line). R-max (red) does not require a stimulus-specific reward prediction, but it leads to increased performance. Points with/without critic are offset horizontally for visibility; they correspond to the ticks of the abscissa. The performance decreases for large number of stimuli/response pairs because as the learned weights become less specialized and closer to the reference weights (see inset), the reference weights' performance becomes the upper bound on the performance. Inset, Normalized scalar product of the learned and reference weights, ww*ww* (shown on the vertical axis, horizontal axis shows the same values as main graph). Only data for R-max with critic is shown. Red dashed line, Exponential fit of the data. Black dashed line and gray area represent the mean and the SD for random, uniformly drawn weights w⃗, respectively. In all panels (except inset of D), the dotted line shows the performance before learning and the dashed line shows the performance of reference weights.
Figure 4.
Figure 4.
Results applied to a more realistic spatiotemporal trajectory-learning task. The learning set-up was different from that of Figure 1 in several ways. A, Stochastic input. The firing rates of the inputs are sums of a fixed number of randomly distributed Gaussians. Firing rates (colored areas) are constant over trials, but the spike trains vary from trial to trial (black spikes). Tasks A and B are randomly interleaved. A fraction of inputs fires both on presentation of tasks A and B, the other neurons fire only for a particular task. The network structure is the same as in Figure 1, but with 350 inputs and 200 neurons. B, Population vector coding. The spike trains of the output neurons are filtered to yield postsynaptic rates ri(t) (upper left). Each output neuron has a preferred direction, υ⃗i (upper right); the actual direction of motion is the population vector, υ(t)=iri(t)υi/iri(t)υi (bottom right). The preferred directions of the neurons are randomly distributed on the three-dimensional unit sphere (bottom left). C, Reward-modulated STDP can learn spatiotemporal trajectories. The network has to learn two target trajectories (red traces) in response to two different inputs. Target trajectory A is in the xy plane and B is in the xz plane. The green and blue traces show the output trajectory of the last trials for tasks A and B, respectively. Gray shadows show the deviation of the trajectories with respect to their respective target planes. The network learned for 10,000 trials, using the R-max learning rule with critic. D, The reward is calculated from the difference between learned and target trajectories. The plot shows the scalar product of the actual direction of motion υ⃗ and the target direction υ⃗*, averaged over the last 20 trials of the simulation; higher values represent better learning. The reward given at the end of a trial is the positive part of this scalar product, averaged over the whole trial, Rn=1T0T[υ(t)υ*(t)]+dt. E, Results from the spike train learning experiment apply to trajectory learning. The bars represent the average reward over the last 100 trials (of 10,000 trials for the whole learning sequence). Error bars show SD for 20 different trajectory pairs. Each learning rule was simulated in three settings, as follows: randomly alternating tasks with reward prediction system (critic), tasks alternating in blocks of 500 trials without critic (block), and randomly alternating tasks without critic (rand.). The hatched bars represent R-STDP without a post-before-pre window, corresponding to λ = 0 in Figure 2E. The dotted line shows the performance before learning.

Similar articles

Cited by

References

    1. Arbuthnott GW, Wickens J. Space, time and dopamine. Trends Neurosci. 2007;30:62–69. - PubMed
    1. Artola A, Bröcher S, Singer W. Different voltage dependent thresholds for inducing long-term depression and long-term potentiation in slices of rat visual cortex. Nature. 1990;347:69–72. - PubMed
    1. Baras D, Meir R. Reinforcement learning, spike-time-dependent plasticity, and the BCM rule. Neural Comput. 2007;19:2245–2279. - PubMed
    1. Bi GQ, Poo MM. Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. J Neurosci. 1998;18:10464–10472. - PMC - PubMed
    1. Bienenstock EL, Cooper LN, Munroe PW. Theory of the development of neuron selectivity: orientation specificity and binocular interaction in visual cortex. J Neurosci. 1982;2:32–48. Reprinted in Anderson and Rosenfeld (1990) - PMC - PubMed

Publication types

LinkOut - more resources