qc3C: Reference-free quality control for Hi-C sequencing data

PLoS Comput Biol. 2021 Oct 11;17(10):e1008839. doi: 10.1371/journal.pcbi.1008839. eCollection 2021 Oct.

Abstract

Hi-C is a sample preparation method that enables high-throughput sequencing to capture genome-wide spatial interactions between DNA molecules. The technique has been successfully applied to solve challenging problems such as 3D structural analysis of chromatin, scaffolding of large genome assemblies and more recently the accurate resolution of metagenome-assembled genomes (MAGs). Despite continued refinements, however, preparing a Hi-C library remains a complex laboratory protocol. To avoid costly failures and maximise the odds of successful outcomes, diligent quality management is recommended. Current wet-lab methods provide only a crude assay of Hi-C library quality, while key post-sequencing quality indicators used have-thus far-relied upon reference-based read-mapping. When a reference is accessible, this reliance introduces a concern for quality, where an incomplete or inexact reference skews the resulting quality indicators. We propose a new, reference-free approach that infers the total fraction of read-pairs that are a product of proximity ligation. This quantification of Hi-C library quality requires only a modest amount of sequencing data and is independent of other application-specific criteria. The algorithm builds upon the observation that proximity ligation events are likely to create k-mers that would not naturally occur in the sample. Our software tool (qc3C) is to our knowledge the first to implement a reference-free Hi-C QC tool, and also provides reference-based QC, enabling Hi-C to be more easily applied to non-model organisms and environmental samples. We characterise the accuracy of the new algorithm on simulated and real datasets and compare it to reference-based methods.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Animals
  • Chromosome Mapping* / methods
  • Chromosome Mapping* / standards
  • DNA / chemistry
  • DNA / genetics
  • Gene Library
  • Genomics* / methods
  • Genomics* / standards
  • High-Throughput Nucleotide Sequencing* / methods
  • High-Throughput Nucleotide Sequencing* / standards
  • Humans
  • Quality Control*
  • Software*
  • Turtles

Substances

  • DNA

Grants and funding

This research was supported by the Australian Government through the Australian Research Council Discovery Projects funding scheme under the project DP180101506, http://purl.org/au-research/grants/arc/DP180101506 (to AED). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.