#### 期刊菜单

Comparison of DNA Sequences Based on Prefix Identifiers and Their Locations

Abstract: Comparison of molecular sequence is the most basic and important problem in bioinformatics. DNA sequence similarity analysis is an important research topic. Alignment-free method is one of the methods to study sequence comparison. It overcomes the limitation of alignment method and is faster than alignment method. In this paper, from the point of view of prefix identifier location, the alignment-free method of sequence analysis is proposed by using information entropy. Based on the prefix tree and the prefix identifier of biological sequences, the position information of pairwise sequences is extracted by using the common prefix identifiers of pairwise sequences. The absolute value of their position difference is regarded as random variable. Using information entropy, a new DNA sequence similarity measurement method is proposed and an effective model is established. Mitochondrial DNA sequences of 70 mammalian were used as experimental data sets. Construct the Phylogenetic tree based on the similarity distance obtained by the model. The classification results of Phylogenetic tree conform to the current biological classification.

1. 引言

1973年Peter Weiner提出了前缀树模型及其算法 [2]。前缀树模型的提出启发很多在生物信息学领域的学者们，为非比对方法提供了新的研究手段。部分文献基于前缀树，利用序列最长公共子串，提出序列相似性度量模型 [3]。有文献通过讨论k词在序列中的位置分布，来研究序列相似性 [4]。本文将提出一种新的以DNA序列前缀标识符为研究对象的非比对方法。

2. 材料

3. 方法

3.1. 前缀标识符

3.1.1. 线形序列前缀标识符定义

3.1.2. 环形序列前缀标识符定义

3.2. 信息熵理论

3.2.1. 信息熵的定义

$\left[\begin{array}{c}X\\ {P}_{X}\end{array}\right]=\left[\begin{array}{cccc}\begin{array}{l}{x}_{1}\\ {p}_{1}\end{array}& \begin{array}{l}{x}_{2}\\ {p}_{2}\end{array}& \begin{array}{l}\cdots \\ \cdots \end{array}& \begin{array}{l}{x}_{n}\\ {p}_{n}\end{array}\end{array}\right],0\le {p}_{i}\le 1,\underset{i=1}{\overset{n}{\sum }}{p}_{i}=1$ (1)

$H\left(X\right)=-\underset{i=1}{\overset{n}{\sum }}{p}_{i}\mathrm{log}{p}_{i}$ (2)

3.2.2. 信息熵熵的性质渐化性

$H\left(X\right)=H\left({p}_{1},{p}_{2},\cdots ,{p}_{n}\right)=H\left({p}_{1}+{p}_{2},{p}_{3},\cdots ,{p}_{n}\right)+\left({p}_{1}+{p}_{2}\right)H\left(\frac{{p}_{1}}{{p}_{1}+{p}_{2}},\frac{{p}_{2}}{{p}_{1}+{p}_{2}}\right)$ (3)

$0\le p\left({x}_{i}\right)\le 1,k=1,2,\cdots ,n,\underset{i=1}{\overset{n}{\sum }}p\left({x}_{n}\right)=1,p\left({x}_{1}\right)+p\left({x}_{2}\right)>0$

3.3. 相似性度量

4. 模型

(1) 对于任意两条序列S1、S2，分别查找所有位置处的前缀标识符，得到前缀标识符集合，分别记为 $I\left(S1\right)$$I\left(S2\right)$

(2) 根据前缀标识符集合 $I\left(S1\right)$$I\left(S2\right)$，得到序列S1、S2共同前缀标识符，记为 $e\left(i\right)$，并记录共同前缀标识符的个数，记为m。其中，

$m=|I\left(S1\right)\cap I\left(S2\right)|,e\left(i\right)\in I\left(S1\right)\cap I\left(S2\right),i=1,2,\cdots ,m$ (4)

(1) 查找共同前缀标识符 $e\left(i\right)$ 在S1和S2中位置，分别记为 $po{s}^{1}\left(i\right)$$po{s}^{2}\left(i\right)$，其中 $i=1,2,\cdots ,m$

(2) 计算出共同前缀标识符 $e\left(i\right)$ 对应的位置差绝对值，记为 $pos_dis\left(i\right)$

$pos_dis\left(i\right)=|po{s}^{1}\left(i\right)-po{s}^{2}\left(i\right)|,i=1,2,\cdots ,m$ (5)

$POS_DIS=\left\{pos_dis\left(i\right)||pos_dis\left(i\right)=po{s}^{1}\left(i\right)-po{s}^{2}\left(i\right)|,i=1,2,\cdots ,m\right\}$ (6)

(1) 根据位置差绝对值对应的多重集 $POS_DIS$，找到位置差绝对值出现的所有情况，分别记为 $u_pos_dis\left(1\right),\cdots ,u_pos_dis\left(h\right)$，则共有h中情况；

(2) 统计各类位置差绝对值出现的次数，记为 $n\left(1\right),n\left(2\right),\cdots ,n\left(t\right)$

$n\left(1\right)+\cdots +n\left(h\right)=m$ (7)

(3) 计算各类位置差绝对值出现的频率，记为 $f\left(t\right)$，则

$f\left(t\right)=\frac{n\left(t\right)}{m},t=1,2,\cdots ,h$ (8)

(4) 信源的概率空间为

$\left[\begin{array}{c}X\\ {P}_{X}\end{array}\right]=\left[\begin{array}{cccc}\begin{array}{c}u_pos_dis\left(1\right)\\ f\left(1\right)\end{array}& \begin{array}{c}u_pos_dis\left(2\right)\\ f\left(2\right)\end{array}& \begin{array}{l}\cdots \\ \cdots \end{array}& \begin{array}{c}u_pos_dis\left(h\right)\\ f\left(h\right)\end{array}\end{array}\right]$ (9)

(5) 两序列的距离定义为

$dis\left(S1,S2\right)=\underset{t=1}{\overset{h}{\sum }}-f\left(t\right){\mathrm{log}}_{2}f\left(t\right)$ (10)

Figure 1. A phylogenetic tree of 70 mammalian mitochondrial DNA sequences by Methods of this article

Figure 2. A phylogenetic tree of 70 mammalian mitochondrial genomes by The DFT distance of DNA sequences with the 2D numerical mapping [5]

5. 结果讨论与分析

6. 总结

Table 1. Mitochondria DNA sequence information of 70 mammalian genomes

 [1] 李霞, 雷建波, 等. 生物信息学[M]. 第2版. 北京: 人民卫生出版社, 2015: 1-8. [2] Weiner, P. (1973) Linear Pattern Matching Algorithms. 14th Annual Symposium on Switching and Automata Theory (Swat 1973). USA, 15-17 October 1973, 1-11. https://doi.org/10.1109/SWAT.1973.13 [3] Leimeister, C.-A. and Morgenstern, B. (2014) Kmacs: The k-Mismatch Average Common Substring Approach to Alignment-Free Sequence Comparison. Bioinformatics, 30, 2000-2008. https://doi.org/10.1093/bioinformatics/btu331 [4] Amiri, S. and Dinov, I.D. (2016) Comparison of Genomic Data via Statistical Distribution. Journal of Theoretical Biology, 407, 318-327. https://doi.org/10.1016/j.jtbi.2016.07.032 [5] Yin, C.C. and Yau, S.S.-T. (2015) An Improved Model for Whole Genome Phylogenetic Analysis by Fourier Transform. Journal of Theoretical Biology, 382, 99-110. https://doi.org/10.1016/j.jtbi.2015.06.033 [6] Vinga, S. (2013) Information Theory Applications for Biological Sequence Analysis. Bioinformatics, 15, 1-14. [7] Singh, K., Kumar, A. and Gupta, M.K. (2020) Modified k-String in Composition Vector Method for DNA Sequence Comparison Based on Maximum Entropy Principle. Journal of Interdisciplinary Mathematics, 23, 31-41. https://doi.org/10.1080/09720502.2020.1721649 [8] Pinello, L., Lo Bosco, G. and Yuan, G.-C. (2013) Applications of Align-ment-Free Methods in Epigenomics. Bioinformatics, 15, 1-12. [9] 詹青. 基于信息熵理论的基因组特性研究[D]: [硕士学位论文]. 哈尔滨: 哈尔滨工业大学, 2011. [10] 吕峰, 王虹. 信息理论与编码[M]. 第2版. 北京: 人民邮电出版社, 2010: 20-100. [11] Zurano, J.P., Magalhães, F.M., et al. (2019) Cetartiodactyla: Updating a Time-Calibrated Molecular Phylogeny. Molecu-lar Phylogenetics and Evolution, 133, 256-262. https://doi.org/10.1016/j.ympev.2018.12.015