grDNA-Prot:基于氨基酸物理化学特性和支持向量机的DNA结合蛋白预测
grDNA-Prot: The Prediction of DNA-Binding Proteins Based on Physicochemical Properties of Amino Acids and Support Vector Machine
摘要: DNA结合蛋白在细胞内外的各种活动中起着重要作用。本文提出一种新的DNA结合蛋白预测方法(grDNA-Prot),使用20个氨基酸组成频率和基于AAindex数据库531个氨基酸物理化学性质的图形表示法描述蛋白质序列信息。此外,还采用三种特征选择方法来选择最优特征,并通过5折交叉验证,建立了基于支持向量机的DNA结合蛋白识别预测模型。为验证该方法的有效性,本文在独立测试数据集上与其他方法进行了比较。这些结果表明,Hydrophobicity (H)、Physicochemical properties (P)和Alpha and turn properties (A)是有效区分DNA结合蛋白和非DNA结合蛋白的主要氨基酸物理化学性质。
Abstract: DNA-binding proteins played an important role in various intra- and extra-cellular activities. In this paper, a novel grDNA-Prot method of DNA-binding predictor is proposed, the protein sequence in-formation is described with the probabilities of 20 amino acids and the 531 physicochemical prop-erties indices of 20 amino acids in AAindex database based on the Cylindrical graphical representa-tion. Furthermore, we employ three feature selection methods to select the optimal feature, which is used to establish the model for identify DNA-binding proteins basing on support machine vector with 5-fold cross-validation. In order to test the effectiveness of our method, we compare the accu-racy performance with the other methods in independent test dataset. These results demonstrated that the physicochemical properties of hydrophobicity (H), Physicochemical properties (P) and the alpha and turn properties (A) are primarily responsible for distinguishing between DNA-binding proteins and non DNA-binding proteins.
文章引用:张艳萍, 倪建威, 高雅, 陈鹏丞, 李旭涛. grDNA-Prot:基于氨基酸物理化学特性和支持向量机的DNA结合蛋白预测[J]. 计算生物学, 2021, 11(1): 1-11. https://doi.org/10.12677/HJCB.2021.111001

参考文献

[1] Lilley, D.M.J (1995) DNA Protein Structural Interactions. Oxford University Press, Oxford.
[2] Zimmer, C. and Wähnert, U. (1986) Nonintercalating DNA-Binding Ligands: Specificity of the Interaction and Their Use as Tools in Bi-ophysical, Biochemical and Biological Investigations of the Genetic Material. Progress in Biophysics and Molecular Bi-ology, 47, 31-112. [Google Scholar] [CrossRef] [PubMed]
[3] Boute, E., Lieberherr, D., Tognolli, M., Schneider, M. and Bairoch, A. (2007) UniProtKB/Swiss-Prot. In: Edwards, D., Ed., Plant Bioinformatics, Vol. 406, Humana Press, Totowa, 89-112. [Google Scholar] [CrossRef] [PubMed]
[4] Helwa, R. and Hoheisel, J.D. (2010) Analysis of DNA-Protein Interactions: From Nitrocellulose Filter Binding Assays to Microarray Studies. Analyt-ical and Bioanalytical Chemistry, 398, 2551-2561. [Google Scholar] [CrossRef] [PubMed]
[5] Freeman, K., Gwadz, M. and Shore, D. (1995) Molecular and Genetic Analysis of the Toxic Effect of Rap1 Overexpression in Yeast. Genetic, 141, 1253-1262. [Google Scholar] [CrossRef] [PubMed]
[6] Jaiswal, R., Singh, S.K., Bastia, D. and Escalante, C.R. (2015) Crystallization and Preliminary X-Ray Characterization of the Eukaryotic Replication Terminator Reb1-Ter DNA Com-plex. Acta Crystallographica Section F: Structural Biology Communications, 71, 414-418. [Google Scholar] [CrossRef
[7] Buck, M.J. and Lieb, J.D. (2004) Chip-Chip: Considerations for the Design, Analysis, and Application of Genome-Wide Chromatin Immunoprecipitation Experiments. Genomics, 83, 349-360. [Google Scholar] [CrossRef] [PubMed]
[8] Langlois, R.E. and Lu, H. (2010) Boosting the Prediction and Understanding of DNA-Binding Domains from Sequence. Nucleic Acids Research, 38, 3149-3158. [Google Scholar] [CrossRef] [PubMed]
[9] Shanahan, H.P., Garcia, M.A., Jones, S. and Thornton, J.M. (2004) Iden-tifying DNA-Proteins Using Structural Motifs and Electrostatic Potential. Nucleic Acids Research, 32, 4732-4741. [Google Scholar] [CrossRef] [PubMed]
[10] Ahmad, S. and Sarai, A. (2004) Moment-Based Prediction of DNA-Binding Proteins. Journal of Molecular Biology, 341, 65-71. [Google Scholar] [CrossRef] [PubMed]
[11] Lin, W.Z., Fang, J.A., Xiao, X.K. and Chou, K.C. (2011) iD-NA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model. PLoS ONE, 6, e24756. [Google Scholar] [CrossRef] [PubMed]
[12] Kumar, K.K., Pugalenthi, G. and Suganthan, P.N. (2009) DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information Using Random Forest. Journal of Biomolecular Structure and Dynamics, 26, 679-686. [Google Scholar] [CrossRef] [PubMed]
[13] Kumar, M., Gromiha, M.M. and Raghava, G.P. (2007) Identification of DNA-Binding Proteins Using Support Vector Machines and Evolutionary Profiles. BMC Bioinformatics, 8, Article No. 463. [Google Scholar] [CrossRef] [PubMed]
[14] Liu, B., Xu, J., Lan, X., Xu, R., Zhou, J., Wang, X. and Chou, K.C. (2014) iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition. PLoS ONE, 9, e106691. [Google Scholar] [CrossRef] [PubMed]
[15] Zhang, J. and Liu, B. (2017) PSFM-DBT: Identifying DNA-Binding Proteins by Combing Position Specific Frequency Matrix and Distance-Bigram Transformation. Interna-tional Journal of Molecular Sciences, 18, Article No. 1856. [Google Scholar] [CrossRef] [PubMed]
[16] Zhang, J., Chen, Q.C. and Liu, B. (2019) DeepDRBP-2L: A New Ge-nome Annotation Predictor for Identifying DNA Binding Proteins and RNA Binding Proteins Using Convolutional Neural Network and Long Short-Term Memory. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1. [Google Scholar] [CrossRef
[17] Lou, W.C., Wang, X.Q., Chen, F., Chen, Y.X., Jiang, B. and Zhang, H. (2014) Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Ran-dom Forest and Gaussian Naive Bayes. PLoS ONE, 9, e86703. [Google Scholar] [CrossRef] [PubMed]
[18] Wei, L.Y., Tang, J.J. and Zou, Q. (2017) Local-DPP: an Im-proved DNA-Binding Protein Prediction Method by Exploring Local Evolutionary Information. Information Sciences, 384, 135-144. [Google Scholar] [CrossRef
[19] Huang, T., Chen, L., Cai, Y.D. and Chou, K.C. (2011) Classifica-tion and Analysis of Regulatory Pathways Using Graph Property, Biochemical and Physicochemical Property, and Func-tional Property. PLoS ONE, 6, e25297. [Google Scholar] [CrossRef] [PubMed]
[20] Zou, C., Gong, J. and Li, H. (2013) An Improved Sequence Based Prediction Protocol for DNA-Binding Proteins Using SVM and Comprehensive Feature Analysis. BMC Bioin-formatics, 14, Article No. 90. [Google Scholar] [CrossRef] [PubMed]
[21] Li, S., Li, D.P., Zeng, X.X., Wu, Y.F., Guo, L. and Zou, Q. (2014) nDNA-Prot: Identification of DNA-Binding Proteins Based on Unbalanced Classification. BMC Bioinformatics, 15, Ar-ticle No. 298. [Google Scholar] [CrossRef] [PubMed]
[22] Kumar, R., Srivastava, A., Kumari, B. and Kumar M. (2015) Pre-diction of Beta-Lactamase and Its Class by Chou’s Pseudo-Amino Acid Composition and Support Vector Machine. Journal of Theoretical Biology, 365, 96-103. [Google Scholar] [CrossRef] [PubMed]
[23] Shahana, Y.C., Swakkhar, S. and Abdollah, D. (2017) iDNAP-rot-ES: Identification of DNA-Binding Proteins Using Evolutionary and Structural Features. Scientific Reports, 7, Article No. 14938. [Google Scholar] [CrossRef] [PubMed]
[24] Hu, J., Zhou, X.G., Zhu, Y.H., Yu, D.J. and Zhang, G.J. (2020) TargetDBP: Accurate DNA-Binding Protein Prediction via Sequence-Based Multi-View Feature Learning. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17, 1419-1429.
[25] Wang, Y.B., Ding, Y.J., Guo, F., Wei, L.Y. and Tang, J.J. (2017) Improved Detection of DNA-Binding Proteins via Compression Technology on PSSM Information. PLoS ONE, 12, e0185587. [Google Scholar] [CrossRef] [PubMed]
[26] Liu, X.J., Gong, X.J., Yu, H. and Xu, J.H. (2018) A Model Stacking Framework for Identifying DNA Binding Proteins by Orchestrating Multi-View Features and Classifiers. Genes, 9, Article No. 394. [Google Scholar] [CrossRef] [PubMed]
[27] Ahmad, S., Gromiha, M.M. and Sarai, A. (2004) Analysis and Predic-tion of DNA-Binding Proteins and Their Binding Residues Based on Composition, Sequence and Structural Information. Bioinformatics, 20, 477-486. [Google Scholar] [CrossRef] [PubMed]
[28] Liu, B., Fang, L.Y., Wang, S.Y., Wang, X.L., Li, H.T. and Chou K.C. (2015) Identification of MicroRNA Precursor with the Degenerate K-Tuple or Kmer Strategy. Journal of Theoretical Biology, 385, 153-159. [Google Scholar] [CrossRef] [PubMed]
[29] Kawashima, S., Pokarowski, P., Pokarowska, M., Mkolinski, A., Katayama, T. and Kanehisa, M. (2008) AAindex: Amino Acid Index Database, Progress Report 2008. Nucleic Acids Re-search, 36, D202-D205. [Google Scholar] [CrossRef] [PubMed]
[30] Huang, H.L., Lin, I.C., Liou, Y.F., Tsai, C.T., Hsu, K.T., Huang, W.L., Ho, J. and Ho, S.Y. (2011) Predicting and Analyzing DNA-Binding Domains Using a Systematic Approach to Identify-ing a Set of Informative Physicochemical and Biochemical Properties. BMC Bioinformatics, 12, Article No. S47. [Google Scholar] [CrossRef
[31] Tung, C.W. and Ho, S.Y. (2008) Computational Identification of Ubiquitylation Sites from Protein Sequences. BMC Bioinformatics, 9, Article No. 310. [Google Scholar] [CrossRef] [PubMed]
[32] Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73, 273-282. [Google Scholar] [CrossRef
[33] Fang, Y., Guo, Y., Feng, Y. and Li, M. (2008) Predicting DNA-Binding Proteins: Approached from Chou’s Pseudo Amino Acid Composition and Other Specific Sequence Fea-tures. Amino Acids, 24, 103-109. [Google Scholar] [CrossRef] [PubMed]
[34] Huang, Y., Niu, B.F., Gao, Y., Fu, L. and Li, W.Z. (2010) CD-HIT Suite: A Web Server for Clustering and Comparing Biological Sequences. Bioinformatics, 26, 680-682. [Google Scholar] [CrossRef] [PubMed]
[35] Randic, M., Zupan, J., Balaban, A.T., Vikic-Topic, D. and Plavšić, D. (2011) Graphical Representation of Proteins. Chemical Reviews, 111, 790-862. [Google Scholar] [CrossRef] [PubMed]
[36] Yu, J.F., Dou, X.H., Wang, H.B., Sun, X., Zhao, H.Y. and Wang, J.H. (2015) A Novel Cylindrical Representation for Characterizing Intrinsic Properties of Protein Sequences. Journal of Chemical Information and Modeling, 55, 1261-1270. [Google Scholar] [CrossRef] [PubMed]
[37] Zhang, Y.N., Yu, D.J., Li, S.S., Fan, Y.X., Huang, Y. and Shen, H.B. (2012) Prediction Protein-ATP Binding Sites from Primary Sequence through Fusing Bi-Profile Sampling of Multi-View Features. BMC Bioinformatics, 13, Article No. 118. [Google Scholar] [CrossRef] [PubMed]
[38] Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A. and Nielsen, H. (2000) Assessing the Accuracy of Prediction Algorithms for Classification: An Overview. Bioinformatics, 16, 412-424. [Google Scholar] [CrossRef] [PubMed]
[39] Sonego, P., Kocsor, A. and Pongor, S. (2008) ROC Analysis: Applications to the Classification of Biological Sequences and 3D Structures. Briefings in Bioinformatics, 9, 198-209. [Google Scholar] [CrossRef] [PubMed]
[40] Deng, L., Pan, J., Xu, X., Yang, W., Liu, C. and Liu, H. (2018) PDRLGB: Precise DNA-Binding Residue Prediction Using a Light Gradient Boosting Machine. BMC Bioinformatics, 19, Article No. 522. [Google Scholar] [CrossRef] [PubMed]
[41] Peng, H., Long, F.H. and Ding, C. (2015) Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Transactions on Pattern Analysis & Machine Intelligence, 27, 1226-1238. [Google Scholar] [CrossRef