Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 20:10:e63711.
doi: 10.7554/eLife.63711.

Standardized and reproducible measurement of decision-making in mice

Affiliations

Standardized and reproducible measurement of decision-making in mice

International Brain Laboratory et al. Elife. .

Erratum in

  • Correction: Standardized and reproducible measurement of decision-making in mice.
    International Brain Laboratory; Aguillon V, Angelaki DE, Bayer H, Bonacchi N, Carandini M, Cazettes F, Chapuis G, Churchland AK, Dan Y, Dewitt E, Faulkner M, Forrest H, Haetzel L, Häusser M, Hofer SB, Hu F, Khanal A, Krasniak C, Laranjeira I, Mainen ZF, Meijer G, Miska N, Mrsic-Flogel TD, Murakami M, Noel JP, Pan-Vazquez A, Rossant C, Sanders J, Socha K, Terry R, Urai AE, Vergara H, Wells M, Wilson C, Witten IB, Wool L, Zador AM. International Brain Laboratory, et al. Elife. 2022 Oct 27;11:e84310. doi: 10.7554/eLife.84310. Elife. 2022. PMID: 36301084 Free PMC article.

Abstract

Progress in science requires standardized assays whose results can be readily shared, compared, and reproduced across laboratories. Reproducibility, however, has been a concern in neuroscience, particularly for measurements of mouse behavior. Here, we show that a standardized task to probe decision-making in mice produces reproducible results across multiple laboratories. We adopted a task for head-fixed mice that assays perceptual and value-based decision making, and we standardized training protocol and experimental hardware, software, and procedures. We trained 140 mice across seven laboratories in three countries, and we collected 5 million mouse choices into a publicly available database. Learning speed was variable across mice and laboratories, but once training was complete there were no significant differences in behavior across laboratories. Mice in different laboratories adopted similar reliance on visual stimuli, on past successes and failures, and on estimates of stimulus prior probability to guide their choices. These results reveal that a complex mouse behavior can be reproduced across multiple laboratories. They establish a standard for reproducible rodent behavior, and provide an unprecedented dataset and open-access tools to study decision-making in mice. More generally, they indicate a path toward achieving reproducibility in neuroscience through collaborative open-science approaches.

Keywords: behavior; decision making; mouse; neuroscience; reproducibility.

Plain language summary

In science, it is of vital importance that multiple studies corroborate the same result. Researchers therefore need to know all the details of previous experiments in order to implement the procedures as exactly as possible. However, this is becoming a major problem in neuroscience, as animal studies of behavior have proven to be hard to reproduce, and most experiments are never replicated by other laboratories. Mice are increasingly being used to study the neural mechanisms of decision making, taking advantage of the genetic, imaging and physiological tools that are available for mouse brains. Yet, the lack of standardized behavioral assays is leading to inconsistent results between laboratories. This makes it challenging to carry out large-scale collaborations which have led to massive breakthroughs in other fields such as physics and genetics. To help make these studies more reproducible, the International Brain Laboratory (a collaborative research group) et al. developed a standardized approach for investigating decision making in mice that incorporates every step of the process; from the training protocol to the software used to analyze the data. In the experiment, mice were shown images with different contrast and had to indicate, using a steering wheel, whether it appeared on their right or left. The mice then received a drop of sugar water for every correction decision. When the image contrast was high, mice could rely on their vision. However, when the image contrast was very low or zero, they needed to consider the information of previous trials and choose the side that had recently appeared more frequently. This method was used to train 140 mice in seven laboratories from three different countries. The results showed that learning speed was different across mice and laboratories, but once training was complete the mice behaved consistently, relying on visual stimuli or experiences to guide their choices in a similar way. These results show that complex behaviors in mice can be reproduced across multiple laboratories, providing an unprecedented dataset and open-access tools for studying decision making. This work could serve as a foundation for other groups, paving the way to a more collaborative approach in the field of neuroscience that could help to tackle complex research challenges.

PubMed Disclaimer

Conflict of interest statement

VA, DA, HB, NB, MC, FC, GC, AC, YD, ED, MF, HF, LH, MH, SH, FH, AK, CK, IL, ZM, GM, NM, TM, MM, JN, AP, CR, KS, RT, AU, HV, MW, CW, IW, LW, AZ No competing interests declared, JS JIS is the owner of Sanworks LLC which provides hardware and consulting for the experimental set-up described in this work.

Figures

Figure 1.
Figure 1.. Standardized pipeline and apparatus, and training progression in the basic task.
(a) The pipeline for mouse surgeries and training. The number of animals at each stage of the pipeline is shown in bold. (b) Schematic of the task, showing the steering wheel and the visual stimulus moving to the center of the screen vs. the opposite direction, with resulting reward vs. timeout. (c) CAD model of the behavioral apparatus. Top: the entire apparatus, showing the back of the mouse. The screen is shown as transparent for illustration purposes. Bottom: side view of the detachable mouse holder, showing the steering wheel and water spout. A 3D rendered video of the CAD model can be found here. (d) Performance of an example mouse (KS014, from Lab 1) throughout training. Squares indicate choice performance for a given stimulus on a given day. Color indicates the percentage of right (red) and left (blue) choices. Empty squares indicate stimuli that were not presented. Negative contrasts denote stimuli on the left, positive contrasts denote stimuli on the right. (e) Example sessions from the same mouse. Vertical lines indicate when the mouse reached the session-ending criteria based on trial duration (top) and accuracy on high-contrast (>=50%) trials (bottom) averaged over a rolling window of 10 trials (Figure 1—figure supplement 1). (f) Psychometric curves for those sessions, showing the fraction of trials in which the stimulus on the right was chosen (rightward choices) as a function of stimulus position and contrast (difference between right and left, i.e. positive for right stimuli, negative for left stimuli). Circles show the mean and error bars show ±68% confidence intervals. The training history of this mouse can be explored at this interactive web page.
Figure 1—figure supplement 1.
Figure 1—figure supplement 1.. Task trial structure.
Trials began with an enforced quiescent period during which the wheel must be kept still for at least 200 ms, after which was the visual stimulus onset and an audio tone to indicate the start of the closed-loop period. The feedback period began when a response was given or 60 s had elapsed since stimulus onset. On correct trials a reward was given and the stimulus remained in the center of the screen for 1 s. On incorrect trials, there was a noise burst and a 2 s timeout before the next trial.
Figure 1—figure supplement 2.
Figure 1—figure supplement 2.. Distribution of within-session disengagement criteria.
The session ended if one of the three following criteria was met: either the mouse performed fewer than 400 trials in 45 min (not enough trials); or over 400 trials and the session length reached 90 min (session too long); or over 400 trials and its median reaction time (RT) over the last 20 trials was over 5x the median for the whole session (slow-down). Proportion of sessions that ended in each of the three criteria (colored in green; orange; blue, respectively) for all mice that learned the task.
Figure 2.
Figure 2.. Learning rates differed across mice and laboratories.
(a) Performance for each mouse and laboratory throughout training. Performance was measured on easy trials (50% and 100% contrast). Each panel represents a different lab, and each thin curve represents a mouse. The transition from light gray to dark gray indicates when each mouse achieved proficiency in the basic task. Black, performance for example mouse in Figure 1. Thick colored lines show the lab average. Curves stop at day 40, when the automated training procedure suggests that mice be dropped from the study if they have not learned. (b) Same, for contrast threshold, calculated starting from the first session with a 12% contrast (i.e. first session with six or more different trial types), to ensure accurate psychometric curve fitting. Thick colored lines show the lab average from the moment there were three or more datapoints for a given training day. (c) Same, for choice bias. (d-f) Average performance, contrast threshold and choice bias of each laboratory across training days. Black curve denotes average across mice and laboratories (g) Training times for each mouse compared to the distribution across all laboratories (black). Boxplots show median and quartiles. (h) Cumulative proportion of mice to have reached proficiency as a function of training day (Kaplan-Meier estimate). Black curve denotes average across mice and laboratories. Data in (a-g) is for mice that reached proficiency (n = 140). Data in h is for all mice that started training (n = 206).
Figure 2—figure supplement 1.
Figure 2—figure supplement 1.. Learning rates measured by trial numbers.
(a) Performance curves for each mouse, for each laboratory. Performance was measured on easy trials (50% and 100% contrast). Each panel represents a lab, and each thin curve a mouse. The transition from light gray to dark gray indicates when each mouse achieved proficiency in the basic task. Black, performance for example mouse in Figure 1. Thick colored lines show the lab average. Curves stop at day 40, when the automated training procedure suggests that mice be dropped from the study if they have not learned. (b) Average performance curve of each laboratory across consecutive trials. (c) Number of trials to proficiency for each mouse compared to the distribution across all laboratories (black). Boxplots show median and quartiles. (d) Cumulative proportion of mice to have reached proficiency as a function of trials (Kaplan-Meier estimate). Black curve denotes average across mice and laboratories.
Figure 2—figure supplement 2.
Figure 2—figure supplement 2.. Performance variability within and across laboratories decreases with training.
(a) Variability in performance (s.d. of % correct) on easy trials (100% and 50% contrast) (left) within and (right) across laboratories during the first 40 training days. Colors indicate laboratory as in Figures 2–5. (b) Same, for the first 30,000 trials of training.
Figure 3.
Figure 3.. Performance in the basic task was indistinguishable across laboratories.
(a) Psychometric curves across mice and laboratories for the three sessions at which mice achieved proficiency on the basic task. Each curve represents a mouse (gray). Black curve represents the example mouse in Figure 1. Thick colored lines show the lab average. (b) Average psychometric curve for each laboratory. Circles show the mean and error bars ± 68% CI. (c) performance on easy trials (50% and 100% contrasts) for each mouse, plotted per lab and over all labs. Colored dots show individual mice and boxplots show the median and quartiles of the distribution. (d-e) Same, for contrast threshold and bias. (f) Performance of a Naive Bayes classifier trained to predict from which lab mice belonged, based on the measures in (c-e). We included the timezone of the laboratory as a positive control and generated a null-distribution by shuffling the lab labels. Dashed line represents chance-level classification performance. Violin plots: distribution of the 2000 random sub-samples of eight mice per laboratory. White dots: median. Thick lines: interquartile range.
Figure 3—figure supplement 1.
Figure 3—figure supplement 1.. Mouse choices were no more consistent within labs than across labs.
To measure the similarity in choices across mice within a lab, we computed within-lab choice consistency. For each lab and each stimulus, we computed the variance across mice in the fraction of rightward choices. We then computed the inverse (consistency) and averaged the result across stimuli. (a) Within-lab choice consistency for the basic task (same data as in Figure 3) for each lab (dots) and averaged across labs (line). This averaged consistency was not significantly higher (p=0.73) than a null distribution generated by randomly shuffling lab assignments between mice and computing the average within-lab choice variability 10,000 times (violin plot). Therefore, choices were no more consistent within labs than across labs. (b) Same analysis, for the full task (same data as in Figure 4). Within-lab choice consistency on the full task was not higher than expected by chance, p=0.25. In this analysis we computed consistency separately for each stimulus and prior block before averaging across them. Choice consistency was higher on the full task than the basic task; this likely reflects both increased training on the task, and a stronger constraint on choice behavior through the full task’s block structure. (c) As in a, b, but measuring the within-lab consistency of ‘bias shift’ between the 20:80 and 80:20 blocks (as in Figure 4d,e). Within-lab consistency in bias shift was not higher than expected by chance (p=0.31).
Figure 3—figure supplement 2.
Figure 3—figure supplement 2.. Behavioral metrics that were not explicitly harmonized showed small variation across labs.
(a) Average trial duration from stimulus onset to feedback, in the three sessions at which a mouse achieved proficiency in the basic task, shown for individual mice (dots) and as a distribution (box plots). (b) Same, for the average number of trials in each of the three sessions. (c) Same, for the number of trials per minute. Each dot represents a mouse, empty dots denote outliers outside the plotted y-axis range.
Figure 3—figure supplement 3.
Figure 3—figure supplement 3.. Classifiers could not predict lab membership from behavior.
(a) Classification performance of the Naive Bayes classifier that predicted lab membership based on behavioral metrics from Figure 3. In the positive control, the classifier had access to the time zone in which a mouse was trained. In the shuffle condition, the lab labels were randomly shuffled. (b) Confusion matrix for the positive control, showing the proportion of occurrences that a mouse from a given lab (y-axis) was classified to be in the predicted lab (x-axis). Labs in the same time zone from clear clusters, and Lab seven was always correctly predicted because it’s the only lab in its time zone. (c) Confusion matrix for the classifiers based on mouse behavior. The classifier was generally at chance and there was no particular structure to its mistakes. (d-f) Same, for the Random Forest classifier. (g–i) Same, for the Logistic Regression classifier.
Figure 3—figure supplement 4.
Figure 3—figure supplement 4.. Comparable performance across institutions when using a reduced inclusion criterion (>=80% performance on easy trials).
(a) Performance on easy trials (50% and 100% contrasts) for each mouse, plotted over all labs (n = 150 mice). Colored dots show individual mice and boxplots show the median and quartiles of the distribution. (b–f) Same, for (b) contrast threshold, (c) bias, (d) trial duration and (e–f) trials completed per session. As it was the case with our standard inclusion criteria (Figure 3—figure supplement 2), there was a small but significant difference in the number of trials per session across laboratories. All other measured parameters were similar. (g) Performance of a Naive Bayes classifier trained to predict from which lab mice belonged, based on the measures in a-c. We included the timezone of the laboratory as a positive control and generated a null-distribution by shuffling the lab labels. Dashed line represents chance-level classification performance. Violin plots: distribution of the 2000 random sub-samples of 8 mice per laboratory. White dots: median. Thick lines: interquartile range. (h) Confusion matrix for the classifiers based on mouse behavior with reduced inclusion criteria. The classifier was at chance and there was no particular structure to its mistakes.
Figure 3—figure supplement 5.
Figure 3—figure supplement 5.. Behavior was indistinguishable across labs in the first 3 sessions of the full task.
For the first 3 sessions of performing the full task (triggered by achieving proficiency in the basic task, defined by a set of criteria, Figure 1—figure supplement 1d). (a) Bias for each block prior did not vary significantly over labs (Kruskal-Wallis test, 20:80 blocks; p=0.96, 50:50 block; p=0.96, 80:20 block; p=0.89). (b) The contrast thresholds also did not vary systematically over labs (Kruskal-Wallis test, 20:80 block; p=0.078, 50:50 block; p=0.12, 80:20 block; p=0.17). (c) Performance on 100% contrast trials neither (Kruskal-Wallis test, p=0.15). (d) The Naive Bayes classifier trained on the data in (a–c) did not perform above chance level when trying to predict the lab membership of mice. (e), Normalized confusion matrix for the classifier in (d).
Figure 4.
Figure 4.. Mice successfully integrate priors into their decisions and task strategy.
(a) Block structure in an example session. Each session started with 90 trials of 50:50 prior probability, followed by alternating 20:80 and 80:20 blocks of varying length. Presented stimuli (gray, 10-trial running average) and the mouse’s choices (black, 10-trial running average) track the block structure. (b) Psychometric curves shift between blocks for the example mouse. (c) For each mouse that achieved proficiency on the full task (Figure 1—figure supplement 1d) and for each stimulus, we computed a ‘bias shift’ by reading out the difference in choice fraction between the 20:80 and 80:20 blocks (dashed lines). (d) Average shift in rightward choices between block types, as a function of contrast for each laboratory (colors as in 2 c, 3 c; error bars show mean ±68% CI). (e) Shift in rightward choices as a function of contrast, separately for each lab. Each line represents an individual mouse (gray), with the example mouse in black. Thick colored lines show the lab average. (f) Contrast threshold, (g) left lapses, (h) right lapses, and (i) bias separately for the 20:80 and 80:20 block types. Each lab is shown as mean +- s.e.m. (j) Classifier results as in 3 f, based on all data points in (f-i).
Figure 5.
Figure 5.. A probabilistic model reveals a common strategy across mice and laboratories.
(a) Schematic diagram of predictors included in the GLM. Each stimulus contrast (except for 0%) was included as a separate predictor. Past choices were included separately for rewarded and unrewarded trials. The block prior predictor was used only to model data obtained in the full task. (b) Psychometric curves from the example mouse across three sessions in the basic task. Shadow represents 95% confidence interval of the predicted choice fraction of the model. Points and error bars represent the mean and across-session confidence interval of the data. (c-d) Weights for GLM predictors across labs in the basic task, error bars represent the 95% confidence interval across mice. (e-g), as b-d but for the full task.
Figure 5—figure supplement 1.
Figure 5—figure supplement 1.. History-dependent choice updating.
(a) Representing each animal’s ‘history strategy’, defined as the bias shift in their psychometric function as a function of the choice made on the previous trial, separately for when this trial was rewarded or unrewarded. Each animal is shown as a dot, with lab-averages shown larger colored dots. Contours indicate a two-dimensional kernel density estimate across all animals. The red arrow shows the group average in the basic task at its origin, and in the full task at its end (replicated between the left and right panel). (b) as a, but with the strategy space corrected for slow fluctuations in decision bound (Lak et al., 2020a). When taking these slow state-changes into account, the majority of animals use a win-stay lose-switch strategy. (c) History-dependent choice updating, after removing the effect of slow fluctuations in decision bound, as a function of the previous trial’s reward and stimulus contrast. After rewarded trials, choice updating is largest when the visual stimulus was highly uncertain (i.e. had low contrast) but strongly diminished after more certain, rewarded trials. This is in line with predictions from Bayesian models, where an agent continually updates its beliefs about the upcoming stimuli with sensory evidence (Lak et al., 2020a; Mendonça et al., 2018). Appendices.
Figure 5—figure supplement 2.
Figure 5—figure supplement 2.. Parameters of the GLM model of choice across labs.
(a) Parameters of the GLM model for data obtained in the basic task. (b) Same, for the full task the additional panel shows the additional parameter, that is the bias shift in the two blocks. Each point represents the average accuracy for each mouse. (c) Cross validated accuracy of the GLM model across mice and laboratories. Predictions were considered accurate if the GLM predicted the actual choice with >50% chance.
Figure 6.
Figure 6.. Contribution diagram.
The following diagram illustrates the contributions of each author, based on the CRediT taxonomy (Brand et al., 2015). For each type of contribution there are three levels indicated by color in the diagram: 'support’ (light), ‘equal’ (medium) and ‘lead’ (dark).

Comment in

Similar articles

Cited by

References

    1. Abdalla H, Abramowski A, Aharonian F, Ait Benkhali F, Angüner EO, Arakawa M, Arrieta M, Aubert P, Backes M, Balzer A, Barnard M, Becherini Y, Becker Tjus J, Berge D, Bernhard S, Bernlöhr K, Blackwell R, Böttcher M, Boisson C, Bolmont J, Bonnefoy S, Bordas P, Bregeon J, Brun F, Brun P, Bryan M, Büchele M, Bulik T, Capasso M, Carrigan S, Caroff S, Carosi A, Casanova S, Cerruti M, Chakraborty N, Chaves RCG, Chen A, Chevalier J, Colafrancesco S, Condon B, Conrad J, Davids ID, Decock J, Deil C, Devin J, deWilt P, Dirson L, Djannati-Ataï A, Domainko W, Donath A, Drury LOC, Dutson K, Dyks J, Edwards T, Egberts K, Eger P, Emery G, Ernenwein J-P, Eschbach S, Farnier C, Fegan S, Fernandes MV, Fiasson A, Fontaine G, Förster A, Funk S, Füßling M, Gabici S, Gallant YA, Garrigoux T, Gast H, Gaté F, Giavitto G, Giebels B, Glawion D, Glicenstein JF, Gottschall D, Grondin M-H, Hahn J, Haupt M, Hawkes J, Heinzelmann G, Henri G, Hermann G, Hinton JA, Hofmann W, Hoischen C, Holch TL, Holler M, Horns D, Ivascenko A, Iwasaki H, Jacholkowska A, Jamrozy M, Jankowsky D, Jankowsky F, Jingo M, Jouvin L, Jung-Richardt I, Kastendieck MA, Katarzyński K, Katsuragawa M, Katz U, Kerszberg D, Khangulyan D, Khélifi B, King J, Klepser S, Klochkov D, Kluźniak W, Komin N, Kosack K, Krakau S, Kraus M, Krüger PP, Laffon H, Lamanna G, Lau J, Lees J-P, Lefaucheur J, Lemière A, Lemoine-Goumard M, Lenain J-P, Leser E, Lohse T, Lorentz M, Liu R, López-Coto R, Lypova I, Marandon V, Malyshev D, Marcowith A, Mariaud C, Marx R, Maurin G, Maxted N, Mayer M, Meintjes PJ, Meyer M, Mitchell AMW, Moderski R, Mohamed M, Mohrmann L, Morå K, Moulin E, Murach T, Nakashima S, de Naurois M, Ndiyavala H, Niederwanger F, Niemiec J, Oakes L, O’Brien P, Odaka H, Ohm S, Ostrowski M, Oya I, Padovani M, Panter M, Parsons RD, Paz Arribas M, Pekeur NW, Pelletier G, Perennes C, Petrucci P-O, Peyaud B, Piel Q, Pita S, Poireau V, Poon H, Prokhorov D, Prokoph H, Pühlhofer G, Punch M, Quirrenbach A, Raab S, Rauth R, Reimer A, Reimer O, Renaud M, de los Reyes R, Rieger F, Rinchiuso L, Romoli C, Rowell G, Rudak B, Rulten CB, Safi-Harb S, Sahakian V, Saito S, Sanchez DA, Santangelo A, Sasaki M, Schandri M, Schlickeiser R, Schüssler F, Schulz A, Schwanke U, Schwemmer S, Seglar-Arroyo M, Settimo M, Seyffert AS, Shafi N, Shilon I, Shiningayamwe K, Simoni R, Sol H, Spanier F, Spir-Jacob M, Stawarz Ł., Steenkamp R, Stegmann C, Steppa C, Sushch I, Takahashi T, Tavernet J-P, Tavernier T, Taylor AM, Terrier R, Tibaldo L, Tiziani D, Tluczykont M, Trichard C, Tsirou M, Tsuji N, Tuffs R, Uchiyama Y, van der Walt DJ, van Eldik C, van Rensburg C, van Soelen B, Vasileiadis G, Veh J, Venter C, Viana A, Vincent P, Vink J, Voisin F, Völk HJ, Vuillaume T, Wadiasingh Z, Wagner SJ, Wagner P, Wagner RM, White R, Wierzcholska A, Willmann P, Wörnlein A, Wouters D, Yang R, Zaborov D, Zacharias M, Zanin R, Zdziarski AA, Zech A, Zefi F, Ziegler A, Zorn J, Żywucka N. The H.E.S.S. galactic plane survey. Astronomy & Astrophysics. 2018;612:201732098. doi: 10.1051/0004-6361/201732098. - DOI
    1. Aoki R, Tsubota T, Goya Y, Benucci A. An automated platform for high-throughput mouse behavior and physiology with voluntary head-fixation. Nature Communications. 2017;8:1–9. doi: 10.1038/s41467-017-01371-0. - DOI - PMC - PubMed
    1. Ashwood ZC, Roy NA, Stone IR, Churchland AK, Pouget A, Pillow JW. Mice alternate between discrete strategies during perceptual decision-making. bioRxiv. 2020 doi: 10.1101/2020.10.19.346353. - DOI - PMC - PubMed
    1. Bak JH, Choi JY, Akrami A, Witten I, Pillow JW, Sugiyama M, Luxburg UV, Guyon I. In: Advances in Neural Information Processing Systems. Lee D. D, Garnett R, editors. Curran Associates, Inc; 2016. Adaptive optimal training of animal behavior; pp. 1947–1955.
    1. Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016;533:452–454. doi: 10.1038/533452a. - DOI - PubMed

Publication types