基于生物信息学中DNA分子序列模式匹配算法研究实现
Research and Implementation of DNA Molecular Sequence Pattern Matching Algorithm Based on Bioinformatics
DOI: 10.12677/CSA.2023.132024, PDF,    科研立项经费支持
作者: 陈亭宇, 尹国才, 魏国晟:北华航天工业学院计算机学院,河北 廊坊
关键词: 生物信息学分子序列模式匹配算法Bioinformatics Molecular Sequences Pattern Matching Algorithm
摘要: 生物信息学是融合先进的生物科学和计算机技术的一门综合运用数学、信息科学、计算机技术等对生物学、医学的信息进行科学的组织、整理和归纳的科学。DNA分子序列比对是生物信息学中最重要和最基础的研究方向之一,是探究基因与疾病关系的重要手段。本文研究的主要目标是在不确定的分子序列数据中找到所有与目标序列相同且出现概率大于给定阈值的序列,并给出目标序列总数及每个目标序列的起始位点。本文针对现有基于“空间换时间”的分子序列模式匹配算法仅限于次数的计算以及基于生物信息学中双DNA序列比对算法的图像立体匹配方法对于不确定的源数据具有局限性的问题,提出了一种基于加权后缀树的DNA分子序列模式匹配算法。该方法应用加权后缀树为主要数据结构,改进了不确定的源数据的匹配准确度,解决了map数据结构仅限于次数计算的问题,实验结果表明,本文提出的算法在匹配速度及灵敏度上有了一定的提高。
Abstract: Bioinformatics is a science that integrates advanced biological science and computer technology. It integrates mathematics, information science and computer technology to scientifically organize, sort out and conclude the information of biology and medicine. DNA sequence alignment is one of the most important and basic research directions in bioinformatics and an important means to explore the relationship between genes and diseases. The main objective of this paper is to find all sequences that are identical to the target sequence and whose occurrence probability is greater than the given threshold in the uncertain molecular sequence data and to give the total number of target sequences and the starting site of each target sequence. In this paper, a weighted suffix tree-based DNA sequence pattern matching algorithm is proposed to solve the problem that the existing molecular sequence pattern matching algorithm based on “space for time” is limited to the calculation of times, and the image stereo matching method based on the double DNA sequence alignment algorithm in bioinformatics is limited to uncertain source data. This method uses weighted suffix trees as the main data structure, improves the matching accuracy of uncertain source data, and solves the problem that map data structure is limited to number calculation. Experimental results show that the proposed algorithm has improved the matching speed and sensitivity to a certain extent.
文章引用:陈亭宇, 尹国才, 魏国晟. 基于生物信息学中DNA分子序列模式匹配算法研究实现[J]. 计算机科学与应用, 2023, 13(2): 236-250. https://doi.org/10.12677/CSA.2023.132024

参考文献

[1] 应嘉, 赵睿颖, 尚彤. 生物信息学在人基因组计划中的应用[J/OL]. 北京大学学报(医学版), 2002, 34(4): 389-392. [Google Scholar] [CrossRef
[2] 鲍芸, 肖艳群, 王华梁. 高通量测序技术的临床应用及质量管理[J]. 中华检验医学杂志, 2022, 45(11): 1099-1103.
[3] 谢娟英, 王明钊, 许升全. 面向甲基化修饰位点预测的DNA/RNA序列特征编码算法研究进展[J/OL]. 中国科学: 生命科学, 2022, 1-35.[CrossRef
[4] Hasan, M.M., Basith, S., Khatun, M.S., et al. (2021) Meta-i6mA: An Interspecies Predictor for Identifying DNA N6-methyladenine Sites of Plant Genomesby Exploiting Informative Features in an Integrative Machine-Learning Framework. Briefings in Bioinformatics, 22, bbaa202. [Google Scholar] [CrossRef] [PubMed]
[5] Dai, C., Feng, P., Cui, L., et al. (2021) Iterative Feature Representation Algorithm to Improve the Predictive Performance of N7-methylguanosine Sites. Briefings in Bioinformatics, 22, bbaa278. [Google Scholar] [CrossRef] [PubMed]
[6] Fang, T., Zhang, Z., Sun, R., et al. (2019) RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotidecomposition. Molecular Therapy—Nucleic Acids, 18, 739-747. [Google Scholar] [CrossRef] [PubMed]
[7] Liu, L., Lei, X., Meng, J., et al. (2020) ISGm1A: Integration of Sequence Features and Genomic Features to Improve the Prediction of Human m1A RNA Methylation Sites. IEEE Ac-cess, 8, 81971-81977. [Google Scholar] [CrossRef
[8] Yang, X., Ye, X., Li, X., et al. (2021) iDNA-MT: Identifica-tion DNA Modification Sites in Multiple Species by Using Multi-Task Learning Based a Neuralnetwork Tool. Frontiers in Genetics, 12, 411. [Google Scholar] [CrossRef] [PubMed]
[9] Zhang, L., Xiao, X. and Xu, Z.C. (2020) iPromoter-5mC: A Novel Fusion Decision Predictor for the Identification of 5-Methylcytosine Sites in Genome-Wide DNA Promoters. Frontiers in Cell and Developmental Biology, 8, 614. [Google Scholar] [CrossRef] [PubMed]
[10] Khanal, J., Tayara, H. and Chong, K.T. (2020) Identifying Enhancers and Their Strength by the Integration of Word Embedding and Convolution Neural Network. IEEE Access, 8, 58369-58376. [Google Scholar] [CrossRef
[11] Cai, L., Ren, X., Fu, X., et al. (2021) iEnhancer-XG: Inter-pretable Sequence-Based Enhancers and Their Strength Predictor. Bioinformatics, 37, 1060-1067. [Google Scholar] [CrossRef] [PubMed]
[12] Chen, W., Feng, P., Yang, H., et al. (2017) iRNA-AI: Identi-fying the Adenosine to Inosine Editing Sites in RNA Sequences. Oncotarget, 8, 4208-4217. [Google Scholar] [CrossRef] [PubMed]
[13] Chandra, A., Sharma, A., Dehzangi, A., et al. (2019) Bigram-PGK: Phosphoglycerylation Prediction Using the Technique of Bigram Probabilities of Positionspecific Scoring Matrix. BMC Molecular and Cell Biology, 20, 57. [Google Scholar] [CrossRef] [PubMed]
[14] Zhang, Y., Qiao, S., Ji, S., et al. (2020) DeepSite: Bidirectional LSTM and CNN Models for Predicting DNA-Protein Binding. International Journal of Machine Learning and Cyber-netics, 11, 841-851. [Google Scholar] [CrossRef
[15] Michalak, E.M., Burr, M.L., Bannister, A.J., et al. (2019) The Roles of DNA, RNA and Histone Methylation in Ageing and Cancer. Nature Reviews Molecular Cell Biology, 20, 573-589. [Google Scholar] [CrossRef] [PubMed]
[16] 陶赛群, 陈力, 李静. 乙醛脱氢酶2基因多态性与癌症的关系及药物治疗的研究进展[J]. 现代药物与临床, 2022, 37(3): 666-672. [Google Scholar] [CrossRef
[17] 阳雪兰, 曾雷. 基因融合与癌症发生发展关系及其致病机制的研究进展[J]. 吉林大学学报(医学版), 2022, 48(2): 527-532. [Google Scholar] [CrossRef
[18] Wahab, A., Mahmoudi, O., Kim, J. and Chong, K.T. (2020) DNC4mC-Deep: Identification and Analysis of DNA N4-methylcytosine Sites Based on Different Encodingschemes by Using Deep Learning. Cells, 9, 1756. [Google Scholar] [CrossRef] [PubMed]
[19] Liu, L., Song, B., Ma, J., et al. (2020) Bioinformatics Approaches for Deciphering the Epitranscriptome: Recent Progress and Emerging Topics. Computational and Structural Biotechnology Journal, 18, 1587-1604. [Google Scholar] [CrossRef] [PubMed]
[20] Chen, Z., Zhao, P., Li, C., et al. (2021) iLearnPlus: A Comprehen-sive and Automated Machine-Learning Platform for Nucleic Acid and Protein Sequenceanalysis, Prediction and Visuali-zation. Nucleic Acids Research, 49, e60. [Google Scholar] [CrossRef] [PubMed]
[21] 戴胜冬, 杨昆. 计算DNA序列模式特征的匹配算法[J]. 杭州电子科技大学学报(自然科学版), 2015, 35(1): 88-92. [Google Scholar] [CrossRef
[22] Apostolico, A., Crochemore, M., Farach-Colton, M., Galil, Z. and Muthukrishnan, S. (2016) 40 Years of Suffix Trees. Communications of the ACM, 59, 66-73. [Google Scholar] [CrossRef
[23] Iliopoulos, C.S., Makris, C., Panagis, Y., Perdikuri, K., Theodoridis, E. and Tsakalidis, A. (2006) The Weighted Suffix Tree: An Efficient Data Structure for Handling Molecular Weighted Se-quences and Its Applications. Fundamenta Informaticae, 71, 259-277.