基于BiLSTM-CRF的中文藏头诗敏感词检测算法

doi:10.12677/SEA.2023.126089

期刊菜单

基于BiLSTM-CRF的中文藏头诗敏感词检测算法
Chinese Hidden-Head Poem Sensitive Word Detection Algorithm Based on BiLSTM-CRF

DOI: 10.12677/SEA.2023.126089, PDF,
作者: 何亚楠, 游福成：北京印刷学院信息工程学院，北京
关键词: 藏头诗；敏感词检测；BiLSTM-CRF；Acrostic Poetry； Sensitive Word Detection； BiLSTM-CRF

摘要: 在数字化和社交媒体时代，藏头诗作为一种文化传承与现代表达相结合的文学形式，其内容监控成为了互联网平台管理的一个挑战。由于其特殊的构造方式，即每行的开头字连起来可以表达特定意义，这一特性使得其成为了隐藏敏感信息的一种手段。尤其是在社交媒体和即时通讯平台上，用户可能会利用藏头诗来规避敏感词过滤机制。本研究提出了一种基于双向长短期记忆网络(BiLSTM-CRF)的藏头诗敏感词检测算法。该算法首先采用词嵌入方法将文字表示成高维向量，再利用BiLSTM模型对藏头诗正反双向的上下文语义进行理解，并捕获文本序列中跨句藏头词的依赖关系，最后通过CRF模型根据标签相关性输出标记序列。我们对算法在不同类型的藏头诗数据集上进行了测试，结果显示该算法能够有效地识别出敏感词汇，具有较高的准确率和召回率。本算法对于监管自动生成的文本内容，尤其是在保护文化传承和遵守网络法规方面显示出其重要价值。

Abstract: In the era of digitization and social media, acrostic poetry, as a literary form that combines cultural heritage with modern expression, has posed a challenge to internet platform management due to content monitoring. Because of its unique construction, where the initial letters of each line can convey a specific meaning when connected, this feature makes it a means of hiding sensitive information. Particularly on social media and instant messaging platforms, users may use acrostic poems to circumvent sensitive word filtering mechanisms. This study proposes a sensitive word detection algorithm for acrostic poetry based on Bidirectional Long Short-Term Memory Networks (BiLSTM-CRF). The algorithm first uses word embedding to represent the text as high-dimensional vectors, then utilizes the BiLSTM model to understand the semantic context of acrostic poems in both forward and backward directions and capture dependencies of acrostic words across sentences in the text sequence. Finally, the CRF model outputs label sequences based on label relevance. We tested the algorithm on various types of acrostic poetry datasets, and the results demonstrate that the algorithm can effectively identify sensitive words with high accuracy and recall. This algorithm has significant value for monitoring automatically generated text content, particularly in preserving cultural heritage and complying with internet regulations.

文章引用：何亚楠, 游福成. 基于BiLSTM-CRF的中文藏头诗敏感词检测算法[J]. 软件工程与应用, 2023, 12(6): 915-921. https://doi.org/10.12677/SEA.2023.126089

参考文献

[1]	Sara Sood, Judd Antin, Elizabeth Churchill. (2012) Profanity Use in Online Communities. CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 05-10 May 2012, New York, 1481-1490. [Google Scholar] [CrossRef]
[2]	Liu, C., Wang, W.Y., Wang, M., et al. (2017) An Efficient Instance Selection Algorithm to Reconstruct Training Set for Support Vector Machine. Knowledge-Based Systems, 116, 58-73. [Google Scholar] [CrossRef]
[3]	Guan, D.H., Yuan, W.W., Lee, Y.K., et al. (2008) Improving Supervised Learning Performance by Using Fuzzy Clustering Method to Select Training Data. Journal of Intelligent & Fuzzy Systems, 19, 321-334.
[4]	Xue, P.Q., Nurbol, and Wushour, I. (2016) Sensitive Information Filtering Algorithm Based on Text Information Network. Computer Engineering & Design, 37, 2447-2452.
[5]	张若彬, 刘嘉勇, 何祥. 基于BLSTM-CRF模型的安全漏洞领域命名实体识别[J]. 四川大学学报(自然科学版), 2019, 56(3): 469-475.
[6]	黄炜, 黄建桥, 李岳峰. 基于BiLSTM-CRF的涉恐信息实体识别模型研究[J]. 情报杂志, 2019, 38(12): 149-156.
[7]	尤丽珏, 尹远芳. 基于BiLSTM-CRF模型的医学影像检查报告信息实体识别[J]. 微型电脑应用, 2023, 39(10): 134-137.
[8]	郑贤茹, 李柏岩, 冯珍妮, 等. 基于BERT-BiLSTM-CRF的网络敏感词及变体实体识别[J]. 计算机与数字工程, 2023, 51(7): 1585-1589.
[9]	Dou, G., Zhao, K., Guo, M., et al. (2023) Memristor-Based LSTM Network for Text Classification. Fractals, 31, Article ID: 2340040. [Google Scholar] [CrossRef]
[10]	刘雪梅, 程彭圣男, 李海瑞, 等. 基于字词向量的BiLSTM-CRF水利工程巡检文本实体识别模型[J/OL]. 华北水利水电大学学报(自然科学版), 1-9. http://kns.cnki.net/kcms/detail/41.1432.tv.20231102.1649.002.html, 2023-11-09.

为你推荐

友情链接