Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 9;38(12):5819-5824.
doi: 10.1093/molbev/msab264.

A Daily-Updated Database and Tools for Comprehensive SARS-CoV-2 Mutation-Annotated Trees

Affiliations

A Daily-Updated Database and Tools for Comprehensive SARS-CoV-2 Mutation-Annotated Trees

Jakob McBroome et al. Mol Biol Evol. .

Abstract

The vast scale of SARS-CoV-2 sequencing data has made it increasingly challenging to comprehensively analyze all available data using existing tools and file formats. To address this, we present a database of SARS-CoV-2 phylogenetic trees inferred with unrestricted public sequences, which we update daily to incorporate new sequences. Our database uses the recently proposed mutation-annotated tree (MAT) format to efficiently encode the tree with branches labeled with parsimony-inferred mutations, as well as Nextstrain clade and Pango lineage labels at clade roots. As of June 9, 2021, our SARS-CoV-2 MAT consists of 834,521 sequences and provides a comprehensive view of the virus' evolutionary history using public data. We also present matUtils-a command-line utility for rapidly querying, interpreting, and manipulating the MATs. Our daily-updated SARS-CoV-2 MAT database and matUtils software are available at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ and https://github.com/yatisht/usher, respectively.

Keywords: COVID-19; SARS-CoV-2 phylogenetics; genomic surveillance.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
matUtils functions enable fast, user-friendly analysis of MATs. (A) An example MAT with tree topology corresponding to the MAT on the left and the mutation annotations on each node shown on the right. (B) matUtils annotate allows the user to annotate internal nodes with clade names. In this example, nodes 1 and 3 are annotated with clade names 19A and 19B, respectively. This MAT serves as an input to commands shown in panels C–F. (C) matUtils summary outputs sample-, clade-, and tree-level statistics for the input MAT. (D) matUtils extract allows users to convert an MAT to Newick format (left), subset the MAT for a specified clade (center) or mutation (right), among other functions. (E) matUtils uncertainty outputs parsimony scores, equally parsimonious placements and neighborhood sizes for each sample of an input MAT. Sample B has two equally parsimonious placements, as it could also be placed as a descendant of node 5 with terminal mutations C2G, A4U, and G5C. (F) matUtils introduce can take a list of samples of interest as input and output the largest monophyletic clade and regional association index associated with the input population, along with their predicted introduction nodes and paths. In all panels, user input commands are shown in large fonts (e.g., “matUtils annotate”) and output text from these commands is shown in monospaced fonts.
Fig. 2
Fig. 2
matUtils can generate informative visuals with Auspice. The above trees represent a clade of related B.1.1.7 samples from the United States which secondarily acquired the potentially important spike protein mutation E484K, which is caused by the nucleotide mutation G23012A. These trees were obtained by running the command “matUtils extract -i public-2021-06-09.all.masked.nextclade.pangolin.pb.gz -c B.1.1.7 -m G23012A -H ‘(USA.*)’ -N 500 -j clade_trees -d clade_out,” which selects all samples from clade B.1.1.7 which acquired this mutation and are from the United States, then identifies the minimum set of 500 sample subtrees which contain all of these samples, creating an Auspice v2 format JSON for each subtree (Hadfield et al 2018). This results in 35 distinct subtree JSON files of 500 samples each in the output directory. Panel A represents the entirety of subtree six as viewed with Auspice (Hadfield et al 2018), including blue highlights and a branch label where our mutation of interest occurred. Panel B is zoomed in on this subtree and its sister clade; at this scale, we can read individual sample names and observe that this specific strain has been actively spreading in the United States during April 2021.
Fig. 3
Fig. 3
matUtils uncertainty statistics reveal low-quality sample placements. This Auspice view of an example subtree is annotated with both equally parsimonious placements (in color) and neighborhood size (branch label integers). Eighteen of our 23 samples in the subtree have a single placement and a neighborhood size of 0, indicating high placement certainty for those samples. Of the five samples with multiple equally parsimonious placements, one sample has five equally parsimonious placements with an NSS value of 19, indicating a high level of placement uncertainty for this sample spanning a relatively large neighborhood.

Update of

Similar articles

Cited by

  • phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets.
    De Maio N, Boulton W, Weilguny L, Walker CR, Turakhia Y, Corbett-Detig R, Goldman N. De Maio N, et al. PLoS Comput Biol. 2022 Apr 29;18(4):e1010056. doi: 10.1371/journal.pcbi.1010056. eCollection 2022 Apr. PLoS Comput Biol. 2022. PMID: 35486906 Free PMC article.
  • Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission.
    Karthikeyan S, Levy JI, De Hoff P, Humphrey G, Birmingham A, Jepsen K, Farmer S, Tubb HM, Valles T, Tribelhorn CE, Tsai R, Aigner S, Sathe S, Moshiri N, Henson B, Mark AM, Hakim A, Baer NA, Barber T, Belda-Ferre P, Chacón M, Cheung W, Cresini ES, Eisner ER, Lastrella AL, Lawrence ES, Marotz CA, Ngo TT, Ostrander T, Plascencia A, Salido RA, Seaver P, Smoot EW, McDonald D, Neuhard RM, Scioscia AL, Satterlund AM, Simmons EH, Abelman DB, Brenner D, Bruner JC, Buckley A, Ellison M, Gattas J, Gonias SL, Hale M, Hawkins F, Ikeda L, Jhaveri H, Johnson T, Kellen V, Kremer B, Matthews G, McLawhon RW, Ouillet P, Park D, Pradenas A, Reed S, Riggs L, Sanders A, Sollenberger B, Song A, White B, Winbush T, Aceves CM, Anderson C, Gangavarapu K, Hufbauer E, Kurzban E, Lee J, Matteson NL, Parker E, Perkins SA, Ramesh KS, Robles-Sikisaka R, Schwab MA, Spencer E, Wohl S, Nicholson L, McHardy IH, Dimmock DP, Hobbs CA, Bakhtar O, Harding A, Mendoza A, Bolze A, Becker D, Cirulli ET, Isaksson M, Schiabor Barrett KM, Washington NL, Malone JD, Schafer AM, Gurfield N, Stous S, Fielding-Miller R, Garfein RS, Gaines T, Anderson C, Martin NK, Schooley R, Austin B, MacCannell DR, Kingsmore SF, Lee W, Shah S, Mc… See abstract for full author list ➔ Karthikeyan S, et al. Nature. 2022 Sep;609(7925):101-108. doi: 10.1038/s41586-022-05049-6. Epub 2022 Jul 7. Nature. 2022. PMID: 35798029 Free PMC article.
  • Positive selection underlies repeated knockout of ORF8 in SARS-CoV-2 evolution.
    Wagner C, Kistler KE, Perchetti GA, Baker N, Frisbie LA, Torres LM, Aragona F, Yun C, Figgins M, Greninger AL, Cox A, Oltean HN, Roychoudhury P, Bedford T. Wagner C, et al. Nat Commun. 2024 Apr 13;15(1):3207. doi: 10.1038/s41467-024-47599-5. Nat Commun. 2024. PMID: 38615031 Free PMC article.
  • Maximum likelihood pandemic-scale phylogenetics.
    De Maio N, Kalaghatgi P, Turakhia Y, Corbett-Detig R, Minh BQ, Goldman N. De Maio N, et al. Nat Genet. 2023 May;55(5):746-752. doi: 10.1038/s41588-023-01368-0. Epub 2023 Apr 10. Nat Genet. 2023. PMID: 37038003 Free PMC article.
  • The ongoing evolution of UShER during the SARS-CoV-2 pandemic.
    Hinrichs A, Ye C, Turakhia Y, Corbett-Detig R. Hinrichs A, et al. Nat Genet. 2024 Jan;56(1):4-7. doi: 10.1038/s41588-023-01622-5. Nat Genet. 2024. PMID: 38155331 No abstract available.

References

    1. Ané C, Sanderson MJ.. 2005. Missing the forest for the trees: phylogenetic compression and its implications for inferring complex evolutionary histories. Syst Biol. 54(1):146–157. - PubMed
    1. Chaillon A, Smith DM.. 2021. Phylogenetic analyses of SARS-CoV-2 B.1.1.7 lineage suggest a single origin followed by multiple exportation events versus convergent evolution. Clin Infect Dis. Advance Access published March 26, 2021, doi:10.1093/cid/ciab265 - DOI - PMC - PubMed
    1. Clark AG, Eisen MB, Smith DR, Bergman CM, Oliver B, Markow TA, Kaufman TC, Kellis M, Gelbart W, Iyer VN, et al.; Drosophila 12 Genomes Consortium. 2007. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450(7167):203–218. - PubMed
    1. Cyranoski D. 2021. Alarming COVID variants show vital role of genomic surveillance. Nature 589(7842):337–338. - PubMed
    1. da Silva Filipe A, Shepherd JG, Williams T, Hughes J, Aranday-Cortes E, Asamaphan P, Ashraf S, Balcazar C, Brunker K, Campbell A, et al.; COVID-19 Genomics UK (COG-UK) Consortium. 2021. Genomic epidemiology reveals multiple introductions of SARS-CoV-2 from mainland Europe into Scotland. Nat Microbiol. 6(1):112–122. - PubMed

Publication types