基于一级序列预测蛋白质亚细胞定位的超级学习机方法
Extreme Learning Machine for Protein Subcellular Localization from Primary Sequence
DOI: 10.12677/HJDM.2013.31002, PDF, HTML, 下载: 3,994  浏览: 14,672  国家自然科学基金支持
作者: 石峰, 陈洪, 熊慧娟*:华中农业大学理学院,武汉
关键词: 蛋白质亚细胞定位超级学习机同源蛋白质Protein Subcellular Localization; Extreme Learning Machine; Homologous Protein
摘要: 蛋白质一级序列的亚细胞定位在基因组注释、蛋白质功能预测、药物发现等领域起着重要作用。超级学习机是近年来新兴的机器学习方法。本文探讨了超级学习机在蛋白质亚细胞定位预测中的潜力。为此,我们首先给出了一种新的特征提取策略,将每个蛋白质一级序列表示成25维的数值向量。在此基础上,我们将852组分枝杆菌蛋白质数据分别用基于新特征的支持向量机方法、基于新特征的超级学习机方法和已有的基于伪氨基酸组成特征的支持向量机方法做数值试验。这852组数据从Swiss-Prot 48数据库中选取,分属于四个不同种类。通过在这些数据上做五折交叉数值比较发现,基于新特征提取策略的超级学习机方法的准确率最高,达到了97.2%,超过基于新特征的支持向量机方法的96.4%的准确率以及基于伪氨基酸组成特征的支持向量机方法的95.2%的准确率。
Abstract: Predicting protein subcellular localization from primary sequence is crucial to genome annotation, protein function prediction, drug discovery and etc. Extreme learning machine is an attractive learning method in recent years. This paper explores the potential of extreme learning machine for protein subcellular localization prediction. For this, a new feature selection strategy is established first. By utilizing the feature selection strategy, each primary sequence can be expressed as a 25-dimensional numerical vector. Furthermore, some numerical comparisons of Support Vector Ma-chine with new features, Extreme Learning Machine with new features and another existing Support Vector Machine method with Pseudo amino acid composition features are given on 852 mycobcterial proteins data. The data arises from Swiss-Prot 48 database and belongs to four different classes. Results of five cross-validation for 852 protein sequences show that ELM with new features achieves the best accuracy. It achieves 97.2% accuracy, SVM with new features ob-tains 96.4% accuracy and SVM with Pseudo amino acid composition features displays 95.2% accuracy.
文章引用:石峰, 陈洪, 熊慧娟. 基于一级序列预测蛋白质亚细胞定位的超级学习机方法[J]. 数据挖掘, 2013, 3(1): 6-11. http://dx.doi.org/10.12677/HJDM.2013.31002

参考文献

[1] T. Blum, S. Briesemeister and O. Kohlbacher. MultiLoc2: Inte-grating phylogeny and Gene Ontology terms improves subcellular protein localization prediction. BMC Bioinformatics, 2009, 10: 274.
[2] K. C. Chou, H.-B. Shen. Review: Recent progresses in protein subcellular localization prediction. Analytical Biochemistry, 2007, 370: 1-16.
[3] R. Casadio, P. L. Martelli and A. Pierleoni. The prediction of protein subcellular localization from sequence: A shortcut to functional genome annotation. Briefings in Functional Genomic Proteomic, 2008, 7(1): 63-73.
[4] K. C. Chou, H. B. Shen. A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mPloc 2.0. Plos ONE, 2010, 5(4): e9931.
[5] A. Garg, M. Bhasin and G. P. Raghava. Support vector machine- based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. Journal of Biological Chemistry, 2005, 280: 14427-14432.
[6] M. Rashid, S. Saha and G. P. S. Raghava. Support vector machine-based method for predicting subcellular localization of mycobacterial proteins using evolutionary information and motifs. BMC Bioinformatics, 2007, 8(1): 337.
[7] K.-C. Chou, Z.-C. Wu and X. Xiao. iLoc-Euk: A multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. Plos ONE, 2011, 6(3): e18258.
[8] C. C. Chang, C. J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Internet Systems and Technology, 2011, 2: 1-27.
[9] H. Nakashima, K. Nishikawa. Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. Journal of Molecular Biology, 1994, 238(1): 54-61.
[10] G.-B. Huang, D.-H. Wang and Y. Lan. Extreme learning machines: A survey. International Journal of Machine Learning and Cybernetics, 2011, 2(2): 107-122.
[11] G.-B. Huang, Q.-Y. Zhu and C.-K. Siew. Extreme learning machine: Theory and applications. Neu-rocomputing, 2006, 70: 489- 501.
[12] G.-B. Huang, H.-M. Zhou, X.-J. Ding and R. Zhang. Extreme learning machine for regression and multiclass classification. IEEE Transactions on Systems, Man & Cybernetics-Part B: Cybernetics, 2012, 42(2): 513-529.
[13] H. Lin, H. Ding, F.-B. Guo, Y.-A. Zhang and J. Huang. Predicting subcellular localization of mycobaterial proteins by using Chow’s pseudo amino acid composition. Protein & Peptide Letters, 2008, 15(7): 739-744.
[14] R. Nair, B. Rost. Sequence conserved for subcellular localization. Protein Science, 2002, 11(12): 2836-2847.
[15] Z. Lei, Y. Dai. Assessing protein similarity with gene ontology and its use in subnuclear localization prediction. BMC Bioinformatics, 2006, 7: 491.
[16] S. Mei, W. Fei and S. Zhou. Gene ontology based transfer learning for protein subcellular localization. BMC Bioinformatics, 2011, 12: 44.
[17] S. F. Altschul, T. L. Madden, A. A. Schaffer, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 1997, 25(17): 3389-3402.