Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny

Martin Hunt; Angie S Hinrichs; Daniel Anderson; Lily Karim; Bethany L Dearlove; Jeff Knaggs; Bede Constantinides; Philip W Fowler; Gillian Rodger; Teresa Street; Sheila Lumley; Hermione Webster; Theo Sanderson; Christopher Ruis; Nicola de Maio; Lucas N Amenga-Etego; Dominic S Y Amuzu; Martin Avaro; Gordon A Awandare; Reuben Ayivor-Djanie; Matthew Bashton; Elizabeth M Batty; Yaw Bediako; Denise De Belder; Estefania Benedetti; Andreas Bergthaler; Stefan A Boers; Josefina Campos; Rosina Afua Ampomah Carr; Facundo Cuba; Maria Elena Dattero; Wanwisa Dejnirattisai; Alexander Dilthey; Kwabena Obeng Duedu; Lukas Endler; Ilka Engelmann; Ngiambudulu M Francisco; Jonas Fuchs; Etienne Z Gnimpieba; Soraya Groc; Jones Gyamfi; Dennis Heemskerk; Torsten Houwaart; Nei-Yuan Hsiao; Matthew Huska; Martin Hölzer; Arash Iranzadeh; Hanna Jarva; Chandima Jeewandara; Bani Jolly; Rageema Joseph; Ravi Kant; Karrie Ko Kwan Ki; Satu Kurkela; Maija Lappalainen; Marie Lataretu; Chang Liu; Gathsaurie Neelika Malavige; Tapfumanei Mashe; Juthathip Mongkolsapaya; Brigitte Montes; Jose Arturo Molina Mora; Collins M Morang'a; Bernard Mvula; Niranjan Nagarajan; Andrew Nelson; Joyce M Ngoi; Joana Paula da Paixão; Marcus Panning; Tomas Poklepovich; Peter K Quashie; Diyanath Ranasinghe; Mara Russo; James Emmanuel San; Nicholas D Sanderson; Vinod Scaria; Gavin Screaton; Tarja Sironen; Abay Sisay; Darren Smith; Teemu Smura; Piyada Supasa; Chayaporn Suphavilai; Jeremy Swann; Houriiyah Tegally; Bryan Tegomoh; Olli Vapalahti; Andreas Walker; Robert J Wilkinson; Carolyn Williamson; IMSSC2 Laboratory Network Consortium; Tulio de Oliveira; Timothy Ea Peto; Derrick Crook; Russell Corbett-Detig; Zamin Iqbal

doi:10.1101/2024.04.29.591666

Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny

bioRxiv [Preprint]. 2024 Apr 30:2024.04.29.591666. doi: 10.1101/2024.04.29.591666.

Authors

Martin Hunt^{1

2

3

4}, Angie S Hinrichs⁵, Daniel Anderson¹, Lily Karim^{5

6}, Bethany L Dearlove⁷, Jeff Knaggs^{1

2

3

4}, Bede Constantinides^{2

4}, Philip W Fowler^{2

3

4}, Gillian Rodger^{2

4}, Teresa Street^{2

3}, Sheila Lumley^{2

8}, Hermione Webster², Theo Sanderson⁹, Christopher Ruis^{10

11}, Nicola de Maio¹, Lucas N Amenga-Etego¹², Dominic S Y Amuzu¹², Martin Avaro¹³, Gordon A Awandare¹², Reuben Ayivor-Djanie^{14

15}, Matthew Bashton¹⁶, Elizabeth M Batty^{17

18}, Yaw Bediako¹², Denise De Belder¹⁹, Estefania Benedetti¹³, Andreas Bergthaler⁷, Stefan A Boers²⁰, Josefina Campos¹⁹, Rosina Afua Ampomah Carr^{15

21}, Facundo Cuba¹⁹, Maria Elena Dattero¹³, Wanwisa Dejnirattisai²², Alexander Dilthey²³, Kwabena Obeng Duedu^{15

24}, Lukas Endler⁷, Ilka Engelmann²⁵, Ngiambudulu M Francisco²⁶, Jonas Fuchs²⁷, Etienne Z Gnimpieba²⁸, Soraya Groc²⁹, Jones Gyamfi^{15

30}, Dennis Heemskerk²⁰, Torsten Houwaart²³, Nei-Yuan Hsiao³¹, Matthew Huska³², Martin Hölzer³², Arash Iranzadeh³³, Hanna Jarva³⁴, Chandima Jeewandara³⁵, Bani Jolly^{36

37}, Rageema Joseph³³, Ravi Kant^{38

39

40}, Karrie Ko Kwan Ki⁴¹, Satu Kurkela³⁴, Maija Lappalainen³⁴, Marie Lataretu³², Chang Liu^{42

43}, Gathsaurie Neelika Malavige³⁵, Tapfumanei Mashe⁴⁴, Juthathip Mongkolsapaya^{18

42

43}, Brigitte Montes²⁹, Jose Arturo Molina Mora⁴⁵, Collins M Morang'a¹², Bernard Mvula⁴⁶, Niranjan Nagarajan^{47

48}, Andrew Nelson⁴⁹, Joyce M Ngoi¹², Joana Paula da Paixão²⁶, Marcus Panning²⁷, Tomas Poklepovich¹⁹, Peter K Quashie¹², Diyanath Ranasinghe³⁵, Mara Russo¹³, James Emmanuel San^{50

51}, Nicholas D Sanderson^{2

3}, Vinod Scaria^{37

52}, Gavin Screaton², Tarja Sironen^{38

39}, Abay Sisay⁵³, Darren Smith¹⁶, Teemu Smura^{38

39}, Piyada Supasa^{42

43}, Chayaporn Suphavilai⁴⁷, Jeremy Swann², Houriiyah Tegally⁵⁴, Bryan Tegomoh^{55

56

57}, Olli Vapalahti^{38

39}, Andreas Walker⁵⁸, Robert J Wilkinson^{9

59

60}, Carolyn Williamson³³; IMSSC2 Laboratory Network Consortium; Tulio de Oliveira^{54

61}, Timothy Ea Peto², Derrick Crook², Russell Corbett-Detig^{5

6}, Zamin Iqbal^{1

62}

Affiliations

¹ European Molecular Biology Laboratory - European Bioinformatics Institute, Hinxton, UK.
² Nuffield Department of Medicine, University of Oxford, Oxford, UK.
³ National Institute of Health Research Oxford Biomedical Research Centre, John Radcliffe Hospital, Headley Way, Oxford, UK.
⁴ Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance, University of Oxford, Oxford, UK.
⁵ Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA.
⁶ Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA.
⁷ Institute for Hygiene and Applied Immunology, Center for Pathophysiology, Infectiology and Immunology, Medical University of Vienna, Vienna 1090, Austria.
⁸ Department of Infectious Diseases and Microbiology, John Radcliffe Hospital, Oxford, UK.
⁹ Francis Crick Institute, London, UK.
¹⁰ Victor Phillip Dahdaleh Heart & Lung Research Institute, University of Cambridge, Cambridge, UK.
¹¹ Department of Veterinary Medicine, University of Cambridge, Cambridge, UK.
¹² West African Centre for Cell Biology of Infectious Pathogens (WACCBIP), University of Ghana, Accra, Ghana.
¹³ Servicio de Virus Respiratorios, Instituto Nacional Enfermedades Infecciosas, ANLIS "Dr. Carlos G. Malbrán", Buenos Aires, Argentina.
¹⁴ Laboratory for Medical Biotechnology and Biomanufacturing, International Centre for Genetic Engineering and Biotechnology, Tristie, Italy.
¹⁵ Department of Biomedical Sciences, University of Health and Allied Sciences, Ho, Ghana.
¹⁶ The Hub for Biotechnology in the Built Environment, Department of Applied Sciences, Faculty of Health and Life Sciences, Northumbria University, Newcastle upon Tyne, NE1 8ST, UK.
¹⁷ Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford, UK.
¹⁸ Mahidol-Oxford Tropical Medicine Research Unit, Bangkok, Thailand.
¹⁹ Unidad Operativa Centro Nacional de Genómica y Bioinformática, ANLIS "Dr. Carlos G. Malbrán", Buenos Aires, Argentina.
²⁰ Dept. Medical Microbiology, Leiden University Medical Center, Albinusdreef 2, 2333 ZA, Leiden, The Netherlands.
²¹ Department of Computational Medicine and Bioinformatics, University of Michigan, Michigan, Ann Arbor, MI, USA.
²² Division of Emerging Infectious Disease, Research Department, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkoknoi, Bangkok 10700, Thailand.
²³ Institute of Medical Microbiology and Hospital Hygiene, University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
²⁴ College of Life Sciences, Birmingham City University, Birmingham, UK.
²⁵ Pathogenesis and Control of Chronic and Emerging Infections, Univ Montpellier, INSERM, Etablissement Français du Sang, Virology Laboratory, CHU Montpellier, Montpellier, France.
²⁶ Grupo de Investigação Microbiana e Imunológica, Instituto Nacional de Investigação em Saúde (National Institute for Health Research), Luanda, Angola.
²⁷ Institute of Virology, Freiburg University Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany.
²⁸ Biomedical Engineering Department, University of South Dakota, Sioux Falls, SD 57107.
²⁹ Virology Laboratory, CHU Montpellier, Montpellier, France.
³⁰ School of Health and Life Sciences, Teesside University, Middlesbrough, UK.
³¹ Divison of Medical Virology, University of Cape Town and National Health Laboratory Service.
³² Genome Competence Center (MF1), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany.
³³ Computational Biology Division, University of Cape Town.
³⁴ HUS Diagnostic Center, Clinical Microbiology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland.
³⁵ Allergy Immunology and Cell Biology Unit, Department of Immunology and Molecular Medicine, University of Sri Jayewardenepura, Nugegoda, Sri Lanka.
³⁶ Karkinos Healthcare Private Limited (KHPL), Aurbis Business Parks, Bellandur, Bengaluru, Karnataka, 560103, India.
³⁷ Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, India.
³⁸ Department of Veterinary Biosciences, University of Helsinki, 00014 Helsinki, Finland.
³⁹ Department of Virology, University of Helsinki, 00014 Helsinki, Finland.
⁴⁰ Department of Tropical Parasitology, Institute of Maritime and Tropical Medicine, Medical University of Gdansk, 81-519 Gdynia, Poland.
⁴¹ Department of Microbiology, Singapore General Hospital, Singapore.
⁴² Chinese Academy of Medical Science (CAMS) Oxford Institute (COI), University of Oxford, Oxford, UK.
⁴³ Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK.
⁴⁴ Health System Strengthening Unit, World Health Organisation, Harare, Zimbabwe.
⁴⁵ Centro de investigación en Enfermedades Tropicales & Facultad de Microbiología, Universidad de Costa Rica, Costa Rica.
⁴⁶ Public Health Institute of Malawi, Ministry of Health, Malawi.
⁴⁷ Genome Institute of Singapore, Agency for Science, Technology and Research (A*STAR), Singapore.
⁴⁸ Yong Loo Lin School of Medicine, National University of Singapore, Singapore.
⁴⁹ Department of Applied Sciences, Faculty of Health and Life Sciences, Northumbria University, Newcastle upon Tyne, NE1 8ST, UK.
⁵⁰ Duke Human Vaccine Institute, Duke University, Durham, NC 27710.
⁵¹ University of KwaZulu Natal, Durban, South Africa, 4001.
⁵² Vishwanath Cancer Care Foundation (VCCF), Neelkanth Business Park Kirol Village, West Mumbai, Maharashtra, 400086, India.
⁵³ Department of Medical Laboratory Sciences, College of Health Sciences, Addis Ababa University, P.O.Box 1176, Addis Ababa, Ethiopia.
⁵⁴ Centre for Epidemic Response and Innovation (CERI), Stellenbosch University, South Africa.
⁵⁵ Centre de Coordination des Opérations d'Urgences de Santé Publique, Ministere de Sante Publique, Cameroun.
⁵⁶ University of California, Berkeley, Berkeley, California, USA.
⁵⁷ Nebraska Department of Health and Human Services, Lincoln, Nebraska, USA.
⁵⁸ Institute of Virology, University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
⁵⁹ Centre for Infectious Diseases Research in Africa, University of Cape Town.
⁶⁰ Imperial College London, UK.
⁶¹ KwaZulu-Natal Research Innovation and Sequencing Platform (KRISP), University of KwaZulu-Natal, South Africa.
⁶² Milner Centre for Evolution, University of Bath, UK.

Abstract

The SARS-CoV-2 genome occupies a unique place in infection biology - it is the most highly sequenced genome on earth (making up over 20% of public sequencing datasets) with fine scale information on sampling date and geography, and has been subject to unprecedented intense analysis. As a result, these phylogenetic data are an incredibly valuable resource for science and public health. However, the vast majority of the data was sequenced by tiling amplicons across the full genome, with amplicon schemes that changed over the pandemic as mutations in the viral genome interacted with primer binding sites. In combination with the disparate set of genome assembly workflows and lack of consistent quality control (QC) processes, the current genomes have many systematic errors that have evolved with the virus and amplicon schemes. These errors have significant impacts on the phylogeny, and therefore over the last few years, many thousands of hours of researchers time has been spent in "eyeballing" trees, looking for artefacts, and then patching the tree. Given the huge value of this dataset, we therefore set out to reprocess the complete set of public raw sequence data in a rigorous amplicon-aware manner, and build a cleaner phylogeny. Here we provide a global tree of 3,960,704 samples, built from a consistently assembled set of high quality consensus sequences from all available public data as of March 2023, viewable at https://viridian.taxonium.org. Each genome was constructed using a novel assembly tool called Viridian (https://github.com/iqbal-lab-org/viridian), developed specifically to process amplicon sequence data, eliminating artefactual errors and mask the genome at low quality positions. We provide simulation and empirical validation of the methodology, and quantify the improvement in the phylogeny. Phase 2 of our project will address the fact that the data in the public archives is heavily geographically biased towards the Global North. We therefore have contributed new raw data to ENA/SRA from many countries including Ghana, Thailand, Laos, Sri Lanka, India, Argentina and Singapore. We will incorporate these, along with all public raw data submitted between March 2023 and the current day, into an updated set of assemblies, and phylogeny. We hope the tree, consensus sequences and Viridian will be a valuable resource for researchers.

Publication types

Preprint

Abstract

Publication types

Grants and funding