Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Continuous FastText N-Grams
- PMID: 31750297
- PMCID: PMC6848157
- DOI: 10.3389/fbioe.2019.00305
Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Continuous FastText N-Grams
Abstract
A promoter is a short region of DNA (100-1,000 bp) where transcription of a gene by RNA polymerase begins. It is typically located directly upstream or at the 5' end of the transcription initiation site. DNA promoter has been proven to be the primary cause of many human diseases, especially diabetes, cancer, or Huntington's disease. Therefore, classifying promoters has become an interesting problem and it has attracted the attention of a lot of researchers in the bioinformatics field. There were a variety of studies conducted to resolve this problem, however, their performance results still require further improvement. In this study, we will present an innovative approach by interpreting DNA sequences as a combination of continuous FastText N-grams, which are then fed into a deep neural network in order to classify them. Our approach is able to attain a cross-validation accuracy of 85.41 and 73.1% in the two layers, respectively. Our results outperformed the state-of-the-art methods on the same dataset, especially in the second layer (strength classification). Throughout this study, promoter regions could be identified with high accuracy and it provides analysis for further biological research as well as precision medicine. In addition, this study opens new paths for the natural language processing application in omics data in general and DNA sequences in particular.
Keywords: DNA promoter; convolutional neural network; natural language processing; precision medicine; transcription factor; word embedding.
Copyright © 2019 Le, Yapp, Nagasundaram and Yeh.
Figures
Similar articles
-
iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding.Anal Biochem. 2019 Apr 15;571:53-61. doi: 10.1016/j.ab.2019.02.017. Epub 2019 Feb 26. Anal Biochem. 2019. PMID: 30822398
-
Fast and scalable neural embedding models for biomedical sentence classification.BMC Bioinformatics. 2018 Dec 22;19(1):541. doi: 10.1186/s12859-018-2496-4. BMC Bioinformatics. 2018. PMID: 30577747 Free PMC article.
-
iPromoter-CLA: Identifying promoters and their strength by deep capsule networks with bidirectional long short-term memory.Comput Methods Programs Biomed. 2022 Nov;226:107087. doi: 10.1016/j.cmpb.2022.107087. Epub 2022 Aug 28. Comput Methods Programs Biomed. 2022. PMID: 36099675
-
Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification.In: Kobeissy FH, editor. Brain Neurotrauma: Molecular, Neuropsychological, and Rehabilitation Aspects. Boca Raton (FL): CRC Press/Taylor & Francis; 2015. Chapter 25. In: Kobeissy FH, editor. Brain Neurotrauma: Molecular, Neuropsychological, and Rehabilitation Aspects. Boca Raton (FL): CRC Press/Taylor & Francis; 2015. Chapter 25. PMID: 26269925 Free Books & Documents. Review.
-
A survey on protein-DNA-binding sites in computational biology.Brief Funct Genomics. 2022 Sep 16;21(5):357-375. doi: 10.1093/bfgp/elac009. Brief Funct Genomics. 2022. PMID: 35652477 Review.
Cited by
-
DeepRegFinder: deep learning-based regulatory elements finder.Bioinform Adv. 2024 Jan 14;4(1):vbae007. doi: 10.1093/bioadv/vbae007. eCollection 2024. Bioinform Adv. 2024. PMID: 38343388 Free PMC article.
-
Nucleic Transformer: Classifying DNA Sequences with Self-Attention and Convolutions.ACS Synth Biol. 2023 Nov 17;12(11):3205-3214. doi: 10.1021/acssynbio.3c00154. Epub 2023 Nov 2. ACS Synth Biol. 2023. PMID: 37916871 Free PMC article.
-
Genetic Association between Inflammatory-Related Polymorphism in STAT3, IL-1β, IL-6, TNF-α and Idiopathic Recurrent Implantation Failure.Genes (Basel). 2023 Aug 5;14(8):1588. doi: 10.3390/genes14081588. Genes (Basel). 2023. PMID: 37628639 Free PMC article.
-
Minimum entropy framework identifies a novel class of genomic functional elements and reveals regulatory mechanisms at human disease loci.bioRxiv [Preprint]. 2023 Dec 12:2023.06.11.544507. doi: 10.1101/2023.06.11.544507. bioRxiv. 2023. PMID: 37398170 Free PMC article. Preprint.
-
Target Finder of Transcription Factor (TFoTF): a novel tool to predict transcription factor-targeted genes in cancer.Mol Oncol. 2023 Jul;17(7):1246-1262. doi: 10.1002/1878-0261.13388. Epub 2023 Feb 11. Mol Oncol. 2023. PMID: 36734611 Free PMC article.
References
-
- Bojanowski P., Grave E., Joulin A., Mikolov T. (2017). Enriching word vectors with subword information. Trans. Assoc. Comp. Lingu. 5, 135–146. 10.1162/tacl_a_00051 - DOI
-
- Bradley A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30, 1145–1159. 10.1016/S0031-3203(96)00142-2 - DOI
LinkOut - more resources
Full Text Sources
Other Literature Sources
