Data-driven noise modeling of digital DNA melting analysis enables prediction of sequence discriminating power

Bioinformatics. 2020 Dec 23;36(22-23):5337-5343. doi: 10.1093/bioinformatics/btaa1053. Online ahead of print.

Abstract

Motivation: The need to rapidly screen complex samples for a wide range of nucleic acid targets, like infectious diseases, remains unmet. Digital High-Resolution Melt (dHRM) is an emerging technology with potential to meet this need by accomplishing broad-based, rapid nucleic acid sequence identification. Here, we set out to develop a computational framework for estimating the resolving power of dHRM technology for defined sequence profiling tasks. By deriving noise models from experimentally generated dHRM datasets and applying these to in silico predicted melt curves, we enable the production of synthetic dHRM datasets that faithfully recapitulate real-world variations arising from sample and machine variables. We then use these datasets to identify the most challenging melt curve classification tasks likely to arise for a given application and test the performance of benchmark classifiers.

Results: This toolbox enables the in silico design and testing of broad-based dHRM screening assays and the selection of optimal classifiers. For an example application of screening common human bacterial pathogens, we show that human pathogens having the most similar sequences and melt curves are still reliably identifiable in the presence of experimental noise. Further, we find that ensemble methods outperform whole series classifiers for this task and are in some cases able to resolve melt curves with single-nucleotide resolution.

Availability: Data and code available on https://github.com/lenlan/dHRM-noise-modeling.

Supplementary information: Supplementary data are available at Bioinformatics online.