基于前缀标识符及其位置的DNA序列比较
Comparison of DNA Sequences Based on Prefix Identifiers and Their Locations
摘要: 分子序列比较是生物信息学中最基本、最主要的问题,DNA序列相似性分析是研究的重要的课题。非比对方法是研究序列比较的方法之一,它克服了比对方法的局限,其计算速度更快。本文从前缀标识符位置角度出发,利用信息熵,提出了序列分析的非比对方法。本文通过对生物序列构建前缀树,得到生物序列前缀标识符的基础上,以两两序列的共同前缀标识符为研究对象,提取它们在序列中位置信息,将它们的位置差的绝对值看成随机变量,利用信息熵,提出新的DNA序列相似性度量方法,建立有效的模型。将70个哺乳动物的线粒体DNA序列作为实验数据集,应用该模型得到的相似性距离构建生物进化树。该进化树的分类结果符合当前的生物学分类标准。
Abstract:
Comparison of molecular sequence is the most basic and important problem in bioinformatics. DNA sequence similarity analysis is an important research topic. Alignment-free method is one of the methods to study sequence comparison. It overcomes the limitation of alignment method and is faster than alignment method. In this paper, from the point of view of prefix identifier location, the alignment-free method of sequence analysis is proposed by using information entropy. Based on the prefix tree and the prefix identifier of biological sequences, the position information of pairwise sequences is extracted by using the common prefix identifiers of pairwise sequences. The absolute value of their position difference is regarded as random variable. Using information entropy, a new DNA sequence similarity measurement method is proposed and an effective model is established. Mitochondrial DNA sequences of 70 mammalian were used as experimental data sets. Construct the Phylogenetic tree based on the similarity distance obtained by the model. The classification results of Phylogenetic tree conform to the current biological classification.
参考文献
|
[1]
|
李霞, 雷建波, 等. 生物信息学[M]. 第2版. 北京: 人民卫生出版社, 2015: 1-8.
|
|
[2]
|
Weiner, P. (1973) Linear Pattern Matching Algorithms. 14th Annual Symposium on Switching and Automata Theory (Swat 1973). USA, 15-17 October 1973, 1-11. [Google Scholar] [CrossRef]
|
|
[3]
|
Leimeister, C.-A. and Morgenstern, B. (2014) Kmacs: The k-Mismatch Average Common Substring Approach to Alignment-Free Sequence Comparison. Bioinformatics, 30, 2000-2008. [Google Scholar] [CrossRef] [PubMed]
|
|
[4]
|
Amiri, S. and Dinov, I.D. (2016) Comparison of Genomic Data via Statistical Distribution. Journal of Theoretical Biology, 407, 318-327. [Google Scholar] [CrossRef] [PubMed]
|
|
[5]
|
Yin, C.C. and Yau, S.S.-T. (2015) An Improved Model for Whole Genome Phylogenetic Analysis by Fourier Transform. Journal of Theoretical Biology, 382, 99-110. [Google Scholar] [CrossRef] [PubMed]
|
|
[6]
|
Vinga, S. (2013) Information Theory Applications for Biological Sequence Analysis. Bioinformatics, 15, 1-14.
|
|
[7]
|
Singh, K., Kumar, A. and Gupta, M.K. (2020) Modified k-String in Composition Vector Method for DNA Sequence Comparison Based on Maximum Entropy Principle. Journal of Interdisciplinary Mathematics, 23, 31-41. [Google Scholar] [CrossRef]
|
|
[8]
|
Pinello, L., Lo Bosco, G. and Yuan, G.-C. (2013) Applications of Align-ment-Free Methods in Epigenomics. Bioinformatics, 15, 1-12.
|
|
[9]
|
詹青. 基于信息熵理论的基因组特性研究[D]: [硕士学位论文]. 哈尔滨: 哈尔滨工业大学, 2011.
|
|
[10]
|
吕峰, 王虹. 信息理论与编码[M]. 第2版. 北京: 人民邮电出版社, 2010: 20-100.
|
|
[11]
|
Zurano, J.P., Magalhães, F.M., et al. (2019) Cetartiodactyla: Updating a Time-Calibrated Molecular Phylogeny. Molecu-lar Phylogenetics and Evolution, 133, 256-262. [Google Scholar] [CrossRef] [PubMed]
|