基于PCA和神经网络预测长链非编码RNA
Prediction of Long Non-Coding RNAs Based on PCA and Neural Network
摘要: 长链非编码RNA是一类不编码蛋白质且长度不小于200 nt的转录本。近些年来,长链非编码RNA被发现在生命活动中发挥着重要的作用,而研究其功能的第一步就是准确识别出长链非编码RNA。在本文中,我们基于主成分分析和多层感知机提出了一种识别长链非编码RNA的新方法。我们选择转录本的k-mer作为原始特征向量,使用主成分分析进行降维得到新的特征向量,将其输入到一个含有五个隐藏层的多层感知机中来预测其是否为长链非编码RNA。我们使用人类、小鼠和斑马鱼物种的转录物序列来评估我们所提出的方法,最终在上述物种的normal类型测试集上准确率分别为94.74%,93.25%和93.04%。
Abstract: Long non-coding RNAs are transcripts composed of more than 200 nucleotides that do not encode proteins. In recent years, long non-coding RNAs have been found to play important roles in many biological mechanisms, and the first step to study their functions is to identify long non-coding RNAs accurately. In this paper, we propose a novel method to identify long non-coding RNAs based on principal component analysis and multilayer perceptron. We select the k-mer of the transcript as the original feature vectors and use principal component analysis to reduce the dimension to obtain new feature vectors. The new feature vector of transcript was fed into a multilayer perceptron with five hidden layers to predict the coding ability of the transcript. We used the transcript sequences of human, mouse and zebrafish to evaluate our proposed method and achieved 94.74%, 93.25% and 93.04% accuracies on the normal type test set of the above species, respectively.
文章引用:曹冰倩. 基于PCA和神经网络预测长链非编码RNA[J]. 应用数学进展, 2022, 11(9): 6670-6677. https://doi.org/10.12677/AAM.2022.119706

参考文献

[1] Kapranov, P., St Laurent, G., Raz, T., Ozsolak, F., Reynolds, C.P., Sorensen, P.H., Reaman, G., Milos, P., Arceci, R.J. and Thompson, J.F. (2010) The Majority of Total Nuclear-Encoded Non-Ribosomal RNA in a Human Cell Is ‘Dark Matter’ Un-Annotated RNA. BMC Biology, 8, Article No. 149. [Google Scholar] [CrossRef] [PubMed]
[2] Laurent, G.S., Wahlestedt, C. and Kapranov, P. (2015) The Land-scape of Long Noncoding RNA Classification. Trends in Genetics, 31, 239-251. [Google Scholar] [CrossRef] [PubMed]
[3] Mercer, T.R., Dinger, M.E. and Mattick, J.S. (2009) Long Non-Coding RNAs: Insights into Functions. Nature Reviews Genetics, 10, 155-159. [Google Scholar] [CrossRef] [PubMed]
[4] Schmitz, S.U., Grote, P. and Herrmann, B.G. (2016) Mechanisms of Long Noncoding RNA Function in Development and Disease. Cellular and Molecular Life Sciences, 73, 2491-2509. [Google Scholar] [CrossRef] [PubMed]
[5] Morán, I., Akerman, I., Van De Bunt, M., Xie, R., Benazra, M., Nammo, T., Arnes, L., Nakić, N., García-Hurtado, J. and Rodríguez-Seguí, S. (2012) Human β Cell Transcriptome Analysis Uncovers LncRNAs That Are Tissue-Specific, Dynamically Regulated, and Abnormally Expressed in Type 2 Diabetes. Cell Metabolism, 16, 435-448. [Google Scholar] [CrossRef] [PubMed]
[6] Tsai, M.C., Spitale, R.C. and Chang, H.Y. (2011) Long Intergenic Noncoding RNAs: New Links in Cancer Progression. Cancer Research, 71, 3-7. [Google Scholar] [CrossRef
[7] Kong, L., Zhang, Y., Ye, Z.-Q., Liu, X.-Q., Zhao, S.-Q., Wei, L. and Gao, G. (2007) CPC: Assess the Protein-Coding Potential of Transcripts Using Sequence Features and Support Vector Machine. Nucleic Acids Research, 35, W345-W349. [Google Scholar] [CrossRef] [PubMed]
[8] Lin, M.F., Jungreis, I. and Kellis, M. (2011) PhyloCSF: A Comparative Genomics Method to Distinguish Protein Coding and Non-Coding Regions. Bioinformatics, 27, i275-i282. [Google Scholar] [CrossRef] [PubMed]
[9] Wang, L., Park, H.J., Dasari, S., Wang, S., Kocher, J.-P. and Li, W. (2013) CPAT: Coding-Potential Assessment Tool Using an Alignment-Free Logistic Regression Model. Nucleic Ac-ids Research, 41, e74. [Google Scholar] [CrossRef] [PubMed]
[10] 刘珊珊. 基于深度神经网络的长非编码RNA预测方法研究[D]: [硕士学位论文]. 扬州: 扬州大学, 2019.[CrossRef
[11] Li, A., Zhang, J. and Zhou, Z. (2014) PLEK: A Tool for Predicting Long Non-Coding RNAs and Messenger RNAs Based on an Improved k-Mer Scheme. BMC Bioinformatics, 15, Article No. 311. [Google Scholar] [CrossRef] [PubMed]
[12] Tong, X. and Liu, S. (2019) CPPred: Coding Potential Pre-diction Based on the Global Description of RNA Sequence. Nucleic Acids Research, 47, e43-e43. [Google Scholar] [CrossRef] [PubMed]
[13] Zhang, Y., Jia, C., Fullwood, M.J. and Kwoh, C.K. (2021) DeepCPP: A Deep Neural Network Based on Nucleotide Bias Information and Minimum Distribution Similarity Feature Selection for RNA Coding Potential Prediction. Briefings in Bioinformatics, 22, 2073-2084. [Google Scholar] [CrossRef] [PubMed]
[14] Baek, J., Lee, B., Kwon, S. and Yoon, S. (2018) LncRNAnet: Long Non-Coding RNA Identification Using Deep Learning. Bioinformatics, 34, 3889-3897. [Google Scholar] [CrossRef] [PubMed]
[15] Yang, C., Yang, L., Zhou, M., Xie, H., Zhang, C., Wang, M.D. and Zhu, H. (2018) LncADeep: An ab Initio LncRNA Identification and Functional Annotation Tool Based on Deep Learning. Bioinformatics, 34, 3825-3834. [Google Scholar] [CrossRef] [PubMed]
[16] 刘敬浩, 孙晓伟, 金杰. 基于主成分分析和循环神经网络的入侵检测模型[J]. 中文信息学报, 2020, 34(10): 105-112.
[17] 林寒冰, 金秀玲, 王婷, 林云霞. 基于PCA-CNN的动态短文本分析研究[J]. 科技创新与应用, 2022, 12(11): 44-48+52. [Google Scholar] [CrossRef