Discovering nuclear targeting signal sequence through protein language learning and multivariate analysis

Anal Biochem. 2020 Feb 15:591:113565. doi: 10.1016/j.ab.2019.113565. Epub 2019 Dec 26.

Abstract

Nuclear localization signals (NLSs) are peptides that target proteins to the nucleus by binding to carrier proteins in the cytoplasm that transport their cargo across the nuclear membrane. Accurate identification of NLSs can help elucidate the functions of nuclear protein complexes. The currently known NLS predictors are usually specific to certain species or largely dependent on prior knowledge of NLS basic residues. Thus, a more general predictor is highly desired to reduce the potentially high false positives or false negatives in discovering new NLSs. Here, we report a new method, INSP (Identification Nucleus Signal Peptide), to effectively identify NLS mainly based on statistical knowledge and machine learning algorithms. In our NLS machine learning model, we considered the query protein sequence as text and extracted the sequence context features using a natural language model. These word-vector features encode discriminative knowledge of NLS motif frequency and are thus useful for model recognition. The output of the machine learning model will be fused with statistical knowledge of the query sequence to build a final multivariate regression model for NLS peptide identification. The experimental results demonstrate a promising performance of the new INSP approach. INSP is freely available at: www.csbio.sjtu.edu.cn/bioinf/INSP/for academic use.

Keywords: Machine learning; Natural language processing; Nuclear localization signal; Targeting signal prediction.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Sequence
  • Databases, Protein
  • Machine Learning*
  • Natural Language Processing*
  • Nuclear Localization Signals / chemistry*

Substances

  • Nuclear Localization Signals