基于替换错误的相似片段查找
Similar Fragment Queries Based on Substitution Errors
DOI: 10.12677/CSA.2020.105100, PDF,   
作者: 张 帆*, 谢宇奇, 饶 晨, 王明春:湖南农业大学信息与智能科学技术学院,湖南 长沙
关键词: 相似片段海明距离阈值查找定位Similar Pieces Hamming Distance Threshold Value Locating
摘要: 破译未知语言的关键是寻找相似的字母片段序列。本文针对相似片段的查找,编写了一种新的算法。首先建立索引结构,多次间隔划分得到片段。然后基于海明距离建立相似公式和相似矩阵用于表示两个片段之间的相似度。结合实际,在大量文本记录时发生替换错误的基础下建立相似阈值公式,并通过该公式判断是否为要求查找的相似片段。最后获得了多段文本的相似片段以及其对应的位置。此外使用平均准确率评价算法,经分析和实验表明,该算法有较高的准确率和查找效率。
Abstract: The key to deciphering an unknown language is to look for similar sequences of letter fragments. In this paper, a new algorithm for finding similar fragments is developed. First, the index structure is built and the fragments are divided at intervals. Then the similarity formula and the similarity matrix are established based on the hamming distance to represent the similarity between the two fragments. In combination with practice, the similarity threshold formula is established on the basis of substitution errors in a large number of text records, and the formula is used to judge whether it is the similar fragment to be searched. Finally, the similar fragments of multiple text and their corresponding positions are obtained. In addition, the average accuracy evaluation algorithm is used, and the analysis and experiments show that the algorithm has good accuracy and search efficiency.
文章引用:张帆, 谢宇奇, 饶晨, 王明春. 基于替换错误的相似片段查找[J]. 计算机科学与应用, 2020, 10(5): 971-977. https://doi.org/10.12677/CSA.2020.105100

参考文献

[1] 郭顺, 管河山, 姜青山. 一种新的DNA序列重复片段的查找算法[C]//中国计算机学会. 第二十五届中国数据库学术会议(NDBC2008)论文集, 2008: 414-418.
[2] 王镝, 赵毅, 陈白尘, 等. DNA序列中基于后继数组索引的SATR查找算法[J]. 东北大学学报(自然科学版), 2007, 28(2): 184-188.
[3] 赵毅. 基于海明距离的DNA序列中相似性重复片段查找技术研究[D]: [硕士学位论文]. 沈阳: 东北大学, 2007.
[4] 朱扬勇, 熊赟. DNA序列数据挖掘技术[J]. 软件学报, 2007, 18(11): 2766-2781.