Impact of next-generation sequencing error on analysis of barcoded plasmid libraries of known complexity and sequence

Claire T Deakin; Jeffrey J Deakin; Samantha L Ginn; Paul Young; David Humphreys; Catherine M Suter; Ian E Alexander; Claus V Hallwirth

doi:10.1093/nar/gku607

Impact of next-generation sequencing error on analysis of barcoded plasmid libraries of known complexity and sequence

Nucleic Acids Res. 2014;42(16):e129. doi: 10.1093/nar/gku607. Epub 2014 Jul 10.

Authors

Claire T Deakin¹, Jeffrey J Deakin¹, Samantha L Ginn¹, Paul Young², David Humphreys², Catherine M Suter³, Ian E Alexander⁴, Claus V Hallwirth¹

Affiliations

¹ Gene Therapy Research Unit, Children's Medical Research Institute and The Children's Hospital at Westmead, Westmead, New South Wales 2145, Australia.
² Molecular Genetics Division, Victor Chang Cardiac Research Institute, Sydney, Darlinghurst, New South Wales 2010, Australia.
³ Molecular Genetics Division, Victor Chang Cardiac Research Institute, Sydney, Darlinghurst, New South Wales 2010, Australia Faculty of Medicine, University of New South Wales, Kensington, New South Wales 2052, Australia.
⁴ Gene Therapy Research Unit, Children's Medical Research Institute and The Children's Hospital at Westmead, Westmead, New South Wales 2145, Australia Discipline of Paediatrics and Child Health, The Children's Hospital at Westmead Clinical School, The University of Sydney, Westmead, New South Wales 2145, Australia ian.alexander@health.nsw.gov.au.

Abstract

Barcoded vectors are promising tools for investigating clonal diversity and dynamics in hematopoietic gene therapy. Analysis of clones marked with barcoded vectors requires accurate identification of potentially large numbers of individually rare barcodes, when the exact number, sequence identity and abundance are unknown. This is an inherently challenging application, and the feasibility of using contemporary next-generation sequencing technologies is unresolved. To explore this potential application empirically, without prior assumptions, we sequenced barcode libraries of known complexity. Libraries containing 1, 10 and 100 Sanger-sequenced barcodes were sequenced using an Illumina platform, with a 100-barcode library also sequenced using a SOLiD platform. Libraries containing 1 and 10 barcodes were distinguished from false barcodes generated by sequencing error by a several log-fold difference in abundance. In 100-barcode libraries, however, expected and false barcodes overlapped and could not be resolved by bioinformatic filtering and clustering strategies. In independent sequencing runs multiple false-positive barcodes appeared to be represented at higher abundance than known barcodes, despite their confirmed absence from the original library. Such errors, which potentially impact barcoding studies in an application-dependent manner, are consistent with the existence of both stochastic and systematic error, the mechanism of which is yet to be fully resolved.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Artifacts
Gene Library*
High-Throughput Nucleotide Sequencing / methods*
Plasmids*
Polymerase Chain Reaction
Sequence Analysis, DNA / methods*