Modeling the Amplification of Immunoglobulins through Machine Learning on Sequence-Specific Features

Sci Rep. 2019 Jul 24;9(1):10748. doi: 10.1038/s41598-019-47173-w.


Successful primer design for polymerase chain reaction (PCR) hinges on the ability to identify primers that efficiently amplify template sequences. Here, we generated a novel Taq PCR data set that reports the amplification status for pairs of primers and templates from a reference set of 47 immunoglobulin heavy chain variable sequences and 20 primers. Using logistic regression, we developed TMM, a model for predicting whether a primer amplifies a template given their nucleotide sequences. The model suggests that the free energy of annealing, ΔG, is the key driver of amplification (p = 7.35e-12) and that 3' mismatches should be considered in dependence on ΔG and the mismatch closest to the 3' terminus (p = 1.67e-05). We validated TMM by comparing its estimates with those from the thermodynamic model of DECIPHER (DE) and a model based solely on the free energy of annealing (FE). TMM outperformed the other approaches in terms of the area under the receiver operating characteristic curve (TMM: 0.953, FE: 0.941, DE: 0.896). TMM can improve primer design and is freely available via openPrimeR ( ).

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • DNA Primers / genetics
  • DNA Primers / metabolism
  • Humans
  • Immunoglobulins / genetics
  • Immunoglobulins / metabolism*
  • Logistic Models
  • Machine Learning
  • Models, Statistical
  • Nucleic Acid Amplification Techniques / methods
  • Polymerase Chain Reaction / methods*


  • DNA Primers
  • Immunoglobulins

Associated data

  • figshare/10.6084/m9.figshare.6736175
  • figshare/10.6084/m9.figshare.6736232