基于词嵌入的机器学习方法预测RNA柔性
Word Embedding Based Machine Learning Method for RNA Flexibility Prediction
DOI: 10.12677/biphy.2024.122003, PDF,    国家自然科学基金支持
作者: 朱晓锋, 常富斌, 李春华*:北京工业大学化学与生命科学学院,北京
关键词: RNA柔性词嵌入机器学习RNA Flexibility Word Embedding Machine Learning
摘要: RNA分子的动力学与其功能密切相关。RNA分子的柔性,作为其动力学最基本的特性之一,已被广泛用于研究其折叠性质、结构稳定性和配体结合能力等诸多方面。实验测定RNA柔性的方法往往比较耗时费力,因此急需发展一种快速、准确的理论方法来预测RNA的柔性。为此,本文提出了一种机器学习方法RNAfwe来预测RNA柔性,该方法采用词嵌入技术提取RNA序列特征。RNAfwe与同类基于序列的RNAflex方法比较,结果显示:相比于使用独热编码的RNAflex (One-Hot),RNAfwe在训练和测试集上都获得了更高的皮尔逊相关系数(PCC) 0.5017和0.4704,这表明词嵌入相较于独热编码可从RNA序列中提取与柔性更相关的特征;相比于利用进化信息的RNAflex (PSSM),尽管RNAfwe的性能稍差,但前者需要知道足够的同源序列。这项工作有助于RNA动力学性质的研究,另外为词嵌入技术广泛用于生物信息学研究提供了支持。
Abstract: RNA molecular dynamics is closely related to their functions. The flexibility of RNA molecules, as one of the most fundamental characteristics of their dynamics, has been widely used to study their folding properties, structural stability, ligand binding ability and so on. Experimental methods for measuring RNA flexibility are often time-consuming and labor intensive, so there is an urgent need to develop a fast and accurate theoretical method to predict RNA flexibility. To this end, we propose a machine learning method, RNAfwe, to predict RNA flexibility, which uses the word embedding technique to extract RNA sequence features. The comparison of RNAfwe with the similar sequence-based RNAflex method shows that compared with RNAflex (One-Hot), RNAfwe obtains higher Pearson correlation coefficients (PCC) of 0.5017 and 0.4704 on both training and test sets, indicating that the word embedding could extract the more related features to flexibility from RNA sequences than the one-hot encoding. Compared with RNAflex (PSSM) which uses evolutionary information, although RNAfwe has a slightly inferior performance, the former requires the knowledge of sufficient homologous sequences. This work contributes to the study of RNA dynamic properties, and provides the support for word embedding technique to be widely used in bioinformatics research.
文章引用:朱晓锋, 常富斌, 李春华. 基于词嵌入的机器学习方法预测RNA柔性[J]. 生物物理学, 2024, 12(2): 23-30. https://doi.org/10.12677/biphy.2024.122003

参考文献

[1] Carugo, O. and Argos, P. (1998) Accessibility to Internal Cavities and Ligand Binding Sites Monitored by Protein Crystallographic Thermal Factors. Proteins, Structure, Function, and Bioinformatics, 31, 201-213. [Google Scholar] [CrossRef
[2] Schneider, B., Gelly, J., de Brevern, A.G., et al. (2014) Local Dynamics of Proteins and DNA Evaluated from Crystallographic B Factors. ActaCrystallographica Section D Biological Crystallography, 70, 2413-2419. [Google Scholar] [CrossRef
[3] Liu, Q., Kwoh, C.K. and Li, J. (2013) Binding Affinity Prediction for Protein-Ligand Complexes Based on β Contacts and B Factor. Journal of Chemical Information and Modeling, 53, 3076-3085. [Google Scholar] [CrossRef] [PubMed]
[4] Li, C., Lv, D., Zhang, L., et al. (2016) Approach to the Unfolding and Folding Dynamics of Add A-Riboswitch upon Adenine Dissociation Using a Coarse-Grained Elastic Network Model. The Journal of Chemical Physics, 145, Article ID: 014104. [Google Scholar] [CrossRef] [PubMed]
[5] Hu, Y., Cheng, K., He, L., et al. (2021) NMR-Based Methods for Protein Analysis. Analytical Chemistry, 93, 1866-1879. [Google Scholar] [CrossRef] [PubMed]
[6] Ishima, R. and Torchia, D. (2000) Protein Dynamics from NMR. Nature Structural Biology, 7, 740-743. [Google Scholar] [CrossRef] [PubMed]
[7] Sasmal, D.K., Pulido, L.E., Kasal, S., et al. (2016) Single-Molecule Fluorescence Resonance Energy Transfer in Molecular Biology. Nanoscale, 8, 19928-19944. [Google Scholar] [CrossRef
[8] Hoshino, M., Adachi, S. and Koshihara, S. (2015) Crystal Structure Analysis of Molecular Dynamics Using Synchrotron X-Rays. CrystEngComm, 17, 8786-8795. [Google Scholar] [CrossRef
[9] Christoforides, E., Fourtaka, K., Andreou, A., et al. (2020) X-Ray Crystallography and Molecular Dynamics Studies of the Inclusion Complexes of Geraniol in β-Cyclodextrin, Heptakis (2, 6-di-O-methyl)-β-Cyclodextrin and Heptakis (2, 3, 6-tri-O-methyl)-β-Cyclodextrin. Journal of Molecular Structure, 1202, Article ID: 127350. [Google Scholar] [CrossRef
[10] Scott, A.H. and Ron, O.D. (2018) Molecular Dynamics Simulation for All. Neuron, 99, 1129-1143. [Google Scholar] [CrossRef] [PubMed]
[11] Mccammon, J.A., Gelin, B.R. and Karplus, M. (1977) Dynamics of Folded Proteins. Nature, 267, 585-590. [Google Scholar] [CrossRef] [PubMed]
[12] Bahar, I., Atilgan, A.R. and Erman, B. (1997) Direct Evaluation of Thermal Fluctuations in Proteins Using a Single-Parameter Harmonic Potential. Folding and Design, 2, 173-181. [Google Scholar] [CrossRef
[13] Tian, F., Zhang, C., Fan, X., et al. (2010) Predicting the Flexibility Profile of Ribosomal RNAs. Molecular Informatics, 29, 707-715. [Google Scholar] [CrossRef] [PubMed]
[14] Guruge, I., Taherzadeh, G., Zhan, J., et al. (2018) B-Factor Profile Prediction for RNA Flexibility Using Support Vector Machines. Journal of Computational Chemistry, 39, 407-411. [Google Scholar] [CrossRef] [PubMed]
[15] Wei, H., Wang, B., Yang, J., et al. (2021) RNA Flexibility Prediction with Sequence Profile and Predicted Solvent Accessibility. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 18, 2017-2022. [Google Scholar] [CrossRef
[16] Pun, C.S., Yong, B.Y.S. and Xia, K. (2020) Weighted-Persistent-Homology-Based Machine Learning for RNA Flexibility Analysis. PLOS ONE, 15, e237747. [Google Scholar] [CrossRef] [PubMed]
[17] Nguyen, T., Le, N., Ho, Q., et al. (2019) Using Word Embedding Technique to Efficiently Represent Protein Sequences for Identifying Substrate Specificities of Transporters. Analytical Biochemistry, 577, 73-81. [Google Scholar] [CrossRef] [PubMed]
[18] Goth, G. (2016) Deep or Shallow, NLP Is Breaking Out. Communications of the ACM, 59, 13-16.
[19] Solan, Z., Horn, D., Ruppin, E., et al. (2005) Unsupervised Learning of Natural Languages. Proceedings of the National Academy of Sciences of the United States of America, 102, 11629-11634. [Google Scholar] [CrossRef] [PubMed]
[20] Strait, B.J. and Dewey, T.G. (1996) The Shannon Information Entropy of Protein Sequences. Biophysical Journal, 71, 148-155. [Google Scholar] [CrossRef
[21] Yu, L., Tanwar, D.K., Penha, E.D.S., et al. (2019) Grammar of Protein Domain Architectures. Proceedings of the National Academy of Sciences, 116, 3636-3645. [Google Scholar] [CrossRef] [PubMed]
[22] Ptitsyn, O.B. (1991) How Does Protein Synthesis Give Rise to the 3D-Structure? FEBS Letters, 285, 176-181. [Google Scholar] [CrossRef] [PubMed]
[23] Qiu, W., Lv, Z., Xiao, X., et al. (2021) EMCBOW-GPCR: A Method for Identifying G-Protein Coupled Receptors Based on Word Embedding and Wordbooks. Computational and Structural Biotechnology Journal, 19, 4961-4969. [Google Scholar] [CrossRef] [PubMed]
[24] Hamid, M. and Friedberg, I. (2019) Identifying Antimicrobial Peptides Using Word Embedding with Deep Recurrent Neural Networks. Bioinformatics, 35, 2009-2016. [Google Scholar] [CrossRef] [PubMed]
[25] Nguyen, T., Le, N., Ho, Q., et al. (2020) TNFPred: Identifying Tumor Necrosis Factors Using Hybrid Features Based on Word Embeddings. BMC Medical Genomics, 13, Article No. 155. [Google Scholar] [CrossRef] [PubMed]
[26] Tomas, M., Kai, C., Greg, C., et al. (2013) Efficient Estimation of Word Representations in Vector Space. CoRR. arXiv preprint arXiv:1301.3781
[27] Li, W. and Godzik, A. (2006) Cd-Hit: A Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences. Bioinformatics, 22, 1658-1659. [Google Scholar] [CrossRef] [PubMed]