Data-driven supervised learning of a viral protease specificity landscape from deep sequencing and molecular simulations

Proc Natl Acad Sci U S A. 2019 Jan 2;116(1):168-176. doi: 10.1073/pnas.1805256116. Epub 2018 Dec 26.


Biophysical interactions between proteins and peptides are key determinants of molecular recognition specificity landscapes. However, an understanding of how molecular structure and residue-level energetics at protein-peptide interfaces shape these landscapes remains elusive. We combine information from yeast-based library screening, next-generation sequencing, and structure-based modeling in a supervised machine learning approach to report the comprehensive sequence-energetics-function mapping of the specificity landscape of the hepatitis C virus (HCV) NS3/4A protease, whose function-site-specific cleavages of the viral polyprotein-is a key determinant of viral fitness. We screened a library of substrates in which five residue positions were randomized and measured cleavability of ∼30,000 substrates (∼1% of the library) using yeast display and fluorescence-activated cell sorting followed by deep sequencing. Structure-based models of a subset of experimentally derived sequences were used in a supervised learning procedure to train a support vector machine to predict the cleavability of 3.2 million substrate variants by the HCV protease. The resulting landscape allows identification of previously unidentified HCV protease substrates, and graph-theoretic analyses reveal extensive clustering of cleavable and uncleavable motifs in sequence space. Specificity landscapes of known drug-resistant variants are similarly clustered. The described approach should enable the elucidation and redesign of specificity landscapes of a wide variety of proteases, including human-origin enzymes. Our results also suggest a possible role for residue-level energetics in shaping plateau-like functional landscapes predicted from viral quasispecies theory.

Keywords: machine learning; molecular modeling; protease; sequence−function mapping; substrate specificity.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Antiviral Agents / pharmacology
  • Drug Resistance, Viral / genetics
  • Energy Metabolism
  • Hepacivirus / drug effects
  • Hepacivirus / enzymology
  • Hepacivirus / genetics
  • Hepacivirus / metabolism
  • High-Throughput Nucleotide Sequencing*
  • Metabolomics / methods
  • Models, Molecular*
  • Peptide Hydrolases / genetics*
  • Peptide Hydrolases / metabolism
  • Serine Proteases / genetics*
  • Serine Proteases / metabolism
  • Structure-Activity Relationship
  • Substrate Specificity
  • Supervised Machine Learning*
  • Viral Nonstructural Proteins / genetics*
  • Viral Nonstructural Proteins / metabolism


  • Antiviral Agents
  • Viral Nonstructural Proteins
  • NS3-4A serine protease, Hepatitis C virus
  • Peptide Hydrolases
  • Serine Proteases