基于机器学习算法识别疾病相关的蛋白与金属离子配体的结合残基
Identifying the Binding Residues between Disease-Associated Proteins and Metal-Ion Ligands Based on Machine Learning Algorithm
摘要: 在研究疾病发生机制中,蛋白质与配体相互作用扮演着重要的角色。因为许多蛋白质功能的实现需要结合特定的配体,而金属离子配体对蛋白质功能的实现起到重要作用。确定蛋白质中哪些残基与金属离子配体相互作用,可以帮助研究者理解蛋白质-金属离子相互作用的分子机制,也对人类健康和精准医学有重要意义。本文基于机器学习算法,研究疾病相关的蛋白质与三种金属离子配体的结合。我们分别提取3种序列特征:位置特异性打分矩阵、氨基酸组分信息、二肽组分,并使用随机森林算法和支持向量机算法建立了三种金属离子配体结合残基的分类模型。对于Zn2+结合残基在特征融合中最高准确率(Acc)达到了87%,Mg2+结合残基识别的最高准确率(Acc)达到70%,Ca2+结合残基识别的最高准确率(Acc)达到70%。可见我们的模型对三种金属离子配体的结合残基有一定的识别能力。
Abstract: Protein-ligand interactions play an important role in the pathogenesis of diseases. Many proteins perform their functions by binding to specific ligands, and the binding of protein-metal-ion ligands plays an important role in the realization of protein functions. Identifying which residues in the protein interact with metal-ion ligands can help researchers understand the molecular mechanism of protein-metal ion interaction, and it is important for human health and precision medicine. In this paper, we study the binding of disease-associated proteins to three metal ion ligands based on the machine learning algorithm. We extract three sequence features: position-specific scoring Ma-trix (PSSM), amino acid component information, dipeptide component. Then, the random forest al-gorithm and the support vector machine algorithm were used to establish the classification model of the three metal ion ligand-binding residues. Finally, the highest accuracy (Acc) was 87% for the Zn2+ binding residues in the feature fusion, the highest Accuracy (Acc) of Mg2+ binding residues was 70%, and that of Ca2+ binding residues was 70%. These results show that our model has the ability to identify the binding residues of three metal ion ligands.
文章引用:邹向辉, 冯永娥. 基于机器学习算法识别疾病相关的蛋白与金属离子配体的结合残基[J]. 计算生物学, 2022, 12(3): 23-31. https://doi.org/10.12677/HJCB.2022.123004

参考文献

[1] 张晓瑾. 基于GBM算法识别蛋白质中金属离子配体的结合残基[D]: [硕士学位论文]. 呼和浩特: 内蒙古工业大学, 2019.
[2] Sodhi, J.S., Bryson, K., McGuffin, L.J., et al. (2004) Predicting Metal-Binding Site Residues in Low-Resolution Structural Models. Journal of Molecular Biology, 342, 307-320. [Google Scholar] [CrossRef] [PubMed]
[3] Lin, H.H., Han, L.Y., Zhang, H.L., et al. (2006) Prediction of the Functional Class of Metal-Binding Proteins from Sequence Derived Physicochemical Properties by Support Vector Ma-chine Approach. BMC Bioinformatics, 7, S13. [Google Scholar] [CrossRef
[4] Jiang, Z., Hu, X.Z. Geriletu, G., et al. (2016) Identification of Ca2+-Binding Residues of a Protein from Its Primary Sequence. Genetics and Molecular Research, 15, gmr.15027618. [Google Scholar] [CrossRef] [PubMed]
[5] Cao, X.Y., Hu, X.Z., Zhang, X.J., et al. (2017) Identification of Metal Ion Binding Sites Based on Amino Acid Sequences. PLOS ONE, 12, e0183756. [Google Scholar] [CrossRef] [PubMed]
[6] Liu, L., Hu, X.Z., Feng, Z.X., et al. (2020) Recognizing Ion Ligand-Binding Residues by Random Forest Algorithm Based on Optimized Dihedral Angle. Frontiers in Bioengineer-ing and Biotechnology, 8, Article 493. [Google Scholar] [CrossRef] [PubMed]
[7] Wang, S., Hu, X.Z., Feng, Z.X., et al. (2021) Recognition of Ion Ligand Binding Sites Based on Amino Acid Features with the Fusion of Energy, Physicochemical and Structural Fea-tures. Current Pharmaceutical Design, 27, 1093-1102. [Google Scholar] [CrossRef] [PubMed]
[8] Yang, J.Y., Roy, A. and Yang, Z.Y. (2013) BioLiP: A Semi-Manually Curated Database for Biologically Relevant Ligand-Protein Interactions. Nucleic Acids Research, 41, D1096-D1103. [Google Scholar] [CrossRef] [PubMed]
[9] Bateman, A., Martin, M.-J., Orchard, S., et al. (2020) UniProt: The Universal Protein Knowledgebase in 2021. Nucleic Acids Research, 49, D480-D489.
[10] Kou, G.S. and Feng, Y.E. (2015) Identify Five Kinds of Simple Super Secondary Structures with Quadratic Discriminant Algorithm Based on the Chemical Shifts. Journal of Theoretical Biology, 380, 392-398. [Google Scholar] [CrossRef] [PubMed]
[11] Breiman, L. (2001) Random Forests, Machine Learning 45. Journal of Clinical Microbiology, 2, 199-228. [Google Scholar] [CrossRef
[12] Li, Z.C., Lai, Y.H., Chen, L.L., et al. (2012) Identification of Hu-man Protein Complexes from Local Sub-Graphs of Protein-Protein Interaction Network Based on Random Forest with Topological Structure Features. Analytica Chimica Acta, 718, 32-41. [Google Scholar] [CrossRef] [PubMed]
[13] Walsh, E.S., Kreakie, B.J., Cantwell, M.G. and Nacci, D. (2017) A Random Forest Approach to Predict the Spatial Distribution of Sediment Pollution in an Estuarine System. PLOS ONE, 12, e0179473. [Google Scholar] [CrossRef] [PubMed]
[14] Yang, L., Wu, H., Jin, X., et al. (2020) Study of Cardiovascular Disease Prediction Model Based on Random Forest in Eastern China. Scientific Reports, 10, Article No. 5245. [Google Scholar] [CrossRef] [PubMed]
[15] Sun, C.Z. and Feng, Y.E. (2021) Identify Disordered Regions of Intrinsically Disordered Proteins by Multi-Features Fusion. Current Bioinformatics, 16, 1126-1132. [Google Scholar] [CrossRef
[16] Chang, C. and Lin, C.J. (2011) LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2, Article 27. [Google Scholar] [CrossRef