Clustering based approach for population level identification of condition-associated T-cell receptor β-chain CDR3 sequences

BMC Bioinformatics. 2021 Mar 25;22(1):159. doi: 10.1186/s12859-021-04087-7.


Background: Deep immune receptor sequencing, RepSeq, provides unprecedented opportunities for identifying and studying condition-associated T-cell clonotypes, represented by T-cell receptor (TCR) CDR3 sequences. However, due to the immense diversity of the immune repertoire, identification of condition relevant TCR CDR3s from total repertoires has mostly been limited to either "public" CDR3 sequences or to comparisons of CDR3 frequencies observed in a single individual. A methodology for the identification of condition-associated TCR CDR3s by direct population level comparison of RepSeq samples is currently lacking.

Results: We present a method for direct population level comparison of RepSeq samples using immune repertoire sub-units (or sub-repertoires) that are shared across individuals. The method first performs unsupervised clustering of CDR3s within each sample. It then finds matching clusters across samples, called immune sub-repertoires, and performs statistical differential abundance testing at the level of the identified sub-repertoires. It finally ranks CDR3s in differentially abundant sub-repertoires for relevance to the condition. We applied the method on total TCR CDR3β RepSeq datasets of celiac disease patients, as well as on public datasets of yellow fever vaccination. The method successfully identified celiac disease associated CDR3β sequences, as evidenced by considerable agreement of TRBV-gene and positional amino acid usage patterns in the detected CDR3β sequences with previously known CDR3βs specific to gluten in celiac disease. It also successfully recovered significantly high numbers of previously known CDR3β sequences relevant to each condition than would be expected by chance.

Conclusion: We conclude that immune sub-repertoires of similar immuno-genomic features shared across unrelated individuals can serve as viable units of immune repertoire comparison, serving as proxy for identification of condition-associated CDR3s.

Keywords: Antigen-specific TCR identification; Celiac disease associated TCR clonotypes; Computational antigen-specificity identification; Immune repertoire analysis; Immuno-informatics; TCR clustering; TCR differential abudance analysis; TCR repertoire analysis.

MeSH terms

  • Cluster Analysis
  • Humans
  • Receptors, Antigen, T-Cell*
  • Receptors, Antigen, T-Cell, alpha-beta* / genetics
  • T-Lymphocytes


  • Receptors, Antigen, T-Cell
  • Receptors, Antigen, T-Cell, alpha-beta