Hierarchical Clustering Can Identify B Cell Clones with High Confidence in Ig Repertoire Sequencing Data

J Immunol. 2017 Mar 15;198(6):2489-2499. doi: 10.4049/jimmunol.1601850. Epub 2017 Feb 8.


Adaptive immunity is driven by the expansion, somatic hypermutation, and selection of B cell clones. Each clone is the progeny of a single B cell responding to Ag, with diversified Ig receptors. These receptors can now be profiled on a large scale by next-generation sequencing. Such data provide a window into the microevolutionary dynamics that drive successful immune responses and the dysregulation that occurs with aging or disease. Clonal relationships are not directly measured, but they must be computationally inferred from these sequencing data. Although several hierarchical clustering-based methods have been proposed, they vary in distance and linkage methods and have not yet been rigorously compared. In this study, we use a combination of human experimental and simulated data to characterize the performance of hierarchical clustering-based methods for partitioning sequences into clones. We find that single linkage clustering has high performance, with specificity, sensitivity, and positive predictive value all >99%, whereas other linkages result in a significant loss of sensitivity. Surprisingly, distance metrics that incorporate the biases of somatic hypermutation do not outperform simple Hamming distance. Although errors were more likely in sequences with short junctions, using the entire dataset to choose a single distance threshold for clustering is near optimal. Our results suggest that hierarchical clustering using single linkage with Hamming distance identifies clones with high confidence and provides a fully automated method for clonal grouping. The performance estimates we develop provide important context to interpret clonal analysis of repertoire sequencing data and allow for rigorous testing of other clonal grouping algorithms.

MeSH terms

  • Adaptive Immunity / genetics
  • Antibody Diversity*
  • B-Lymphocytes / physiology*
  • Biological Evolution
  • Clone Cells
  • Cluster Analysis
  • Computational Biology
  • Computer Simulation
  • Datasets as Topic
  • Electronic Data Processing / methods*
  • Genetic Linkage
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Immunoglobulins / genetics
  • Somatic Hypermutation, Immunoglobulin


  • Immunoglobulins