基于BERT的中文计算机实体识别
Chinese Computer Entity Recognition Based On BERT
摘要: 针对中文命名实体识别过程中由于中、英文数字混合导致的文本特征学习不彻底、实体识别边界模糊、对不断涌现新的实体识别不准确等问题,本文提出了一种关于字典的实体识别方法。首先,通过字典进行数据预处理,以减少中、英文和数字符号混合对实体识别的影响,再使用BERT模型获取文本特征,将得到的特征作为双向长短时记忆网络的输入进行训练,然后,利用随机条件场进行解码,得到标注序列,最终获取得到相应实体。该模型在人民日报语料、MSRA语料和计算机领域语料上分别取得了95.10%、95.09%和99.45%的F1值,实验结果表明,本文方法能够有效提升命名实体识别效果。
Abstract: Focused on the problems of incomplete text feature learning, the fuzzy boundary of entity recognition and the inaccurate recognition of emerging new entities caused by the mixing of Chinese and English numbers in the process of Chinese-named entity recognition, this paper proposes a method based on the dictionary entity recognition method. Firstly, the data is preprocessed through the dictionary to reduce the impact of the mixing of Chinese, English and digital symbols on entity recognition; secondly, the BERT model is used for data preprocessing to obtain text features, and use the features as the input of bidirectional long short-term memory for training; thirdly, the conditional random field is used to decode, and the annotated sequence is obtained. Finally, the corresponding entity is obtained. The model achieved F1-score values of 95.10%, 95.09% and 99.45% on the People’s Daily data set, MSRA data set and computer field data set respectively. The experi-mental results show that the method in this paper can effectively improve the effect of named entity recognition.
文章引用:王君仙, 武国宁. 基于BERT的中文计算机实体识别[J]. 计算机科学与应用, 2022, 12(11): 2512-2525. https://doi.org/10.12677/CSA.2022.1211257

参考文献

[1] Chinchor, N. (1995) MUC-6 Named Entity Task Definition (Version 2.1). 6th Message Understanding Conference, Columbia, 6-8 November 1995, 317-332.
[2] Chinchor, N. and Robinson, P. (1997) MUC-7 Named Entity Task Defi-nition. Proceedings of the 7th Conference on Message Understanding, Vol. 29, 1-21.
[3] Sang, E.F. and De Meulder, F. (2003) Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, 31 May 2003, 142-147.
[4] Alfonseca, E. and Manandhar, S. (2002) An Unsupervised Method for General Named Entity Recogni-tion and Automated Concept Discovery. Proceedings of the 1st International Conference on General WordNet, Mysore, 21-25 January 2002, 34-43.
[5] Ekine, S., Sudo, K. and Nobata, C. (2002) Extended Named Entity Hierarchy. Pro-ceedings of the Third International Conference on Language Resources and Evaluation, Las Palmas, May 2002, 1818-1824.
[6] Sekine, S. and Nobata, C. (2004) Definition, Dictionaries and Tagger for Extended Named Entity Hier-archy. Proceedings of the Fourth International Conference on Language Resources and Evaluation, Lisbon, May 2004, 1977-1980.
[7] Marrero, M., Urbano, J., Sánchez-Cuadrado, S., et al. (2013) Named Entity Recognition: Fallacies, Challenges and Opportunities. Computer Standards & Interfaces, 35, 482-489. [Google Scholar] [CrossRef
[8] 赵佳. 基于字符增强的命名实体识别方法研究[D]: [硕士学位论文]. 北京: 北京交通大学, 2020.
[9] Aberdeen, J., Burger, J.D., Day, D., et al. (1995) MITRE: Description of the Alembic System Used for MUC-6. Sixth Message Understanding Conference (MUC-6): Proceedings of a Conference, Columbia, 6-8 November 1995, 141-155. [Google Scholar] [CrossRef
[10] Krupka, G. (1995) SRA: Description of the SRA System as Used for MUC-6. Sixth Message Understanding Conference (MUC-6): Proceedings of a Conference, Columbia, 6-8 Novem-ber 1995, 221-235. [Google Scholar] [CrossRef
[11] Borthwick, A., Sterling, J., Agichtein, E., et al. (1998) NYU: De-scription of the MENE Named Entity System as Used in MUC-7. Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference, Fairfax, 29 April-1 May 1998.
[12] 陆铭, 康雨洁, 俞能海. 简约语法规则和最大熵模型相结合的混合实体识别[J]. 小型微型计算机系统, 2012, 33(3): 537-541.
[13] 冯静, 李正武, 张登云, 等. 基于隐马尔可夫模型的桥梁检测文本命名实体识别[J]. 交通世界, 2020, 8(3): 32-33.
[14] 焦凯楠, 李欣, 朱容辰. 中文领域命名实体识别综述[J]. 计算机工程与应用, 2021, 57(16): 1-15.
[15] 刘浏, 王东波. 命名实体识别研究综述[J]. 情报学报, 2018, 37(3): 329-340.
[16] Carreras, X., Marquez, L. and Padró, L. (2002) Named Entity Extraction Using Adaboost. COLING-02: The 6th Conference on Natural Language Learning (CoNLL-2002), Volume 20, 1-4. [Google Scholar] [CrossRef
[17] 周晓磊, 赵薛蛟, 刘堂亮, 等. 基于SVM-BiLSTM-CRF模型的财产纠纷命名实体识别方法[J]. 计算机系统应用, 2019, 28(1): 245-250.
[18] McCallum, A. and Li, W. (2003) Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, 31 May-1 June 2003, 188-191. [Google Scholar] [CrossRef
[19] Cherry, C. and Guo, H. (2015) The Unrea-sonable Effectiveness of Word Representations for Twitter Named Entity Recognition. Proceedings of the 2015 Confer-ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, May-June 2015, 735-745. [Google Scholar] [CrossRef
[20] 陈曙东, 欧阳小叶. 命名实体识别技术综述[J]. 无线电通信技术, 2020, 46(3): 251-260.
[21] 姜文斌, 赵海兴, 刘群. 基于感知机模型藏文命名实体识别[J]. 计算机工程与应用, 2014(15): 172-176.
[22] Hammerton, J. (2003) Named Entity Recognition with Long Short-Term Memory. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL, Volume 4, 172-175. [Google Scholar] [CrossRef
[23] Peng, N. and Dredze, M. (2016) Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Volume 2, 149-155. [Google Scholar] [CrossRef
[24] Lample, G., Ballesteros, M., Subramanian, S., et al. (2016) Neural Ar-chitectures for Named Entity Recognition. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, June 2016, 260-270. [Google Scholar] [CrossRef
[25] Dong, X., Qian, L., Guan, Y., et al. (2016) A Multiclass Classification Method Based on Deep Learning for Named Entity Recognition in Electronic Medical Records. 2016 New York Scientific Data Summit (NYSDS) IEEE, New York, 14-17 August 2016, 1-10. [Google Scholar] [CrossRef
[26] Shao, Y., Hardmeier, C. and Nivre, J. (2016) Multilingual Named Entity Recognition Using Hybrid Neural Networks. The Sixth Swedish Language Technology Conference (SLTC).
[27] Yadav, V. and Bethard, S. (2019) A Survey on Recent Advances in Named Entity Recognition from Deep Learning Models. Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, August 2018, 2145-2158.
[28] 王子牛, 姜猛, 高建瓴, 陈娅先. 基于BERT的中文命名实体识别方法[J]. 计算机科学, 2019, 46(z2): 138-142.
[29] Devlin, J., Chang, M.W., Lee, K., et al. (2018) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, June 2019, 4171-4186.
[30] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, 4-9 December 2017, 5998-6008.
[31] 庄穆妮, 李勇, 谭旭, 等. 基于BERT-LDA模型的新冠肺炎疫情网络舆情演化仿真[J]. 系统仿真学报, 2021, 33(1): 24-36.
[32] Bengio, Y., Ducharme, R., Vincent, P., et al. (2003) A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3, 1137-1155.
[33] 顾溢. 基于BiLSTM-CRF的复杂中文命名实体识别研究[D]: [硕士学位论文]. 南京: 南京大学, 2019.
[34] 田梓函, 李欣. 基于BERT-CRF模型的中文事件检测方法研究[J]. 计算机工程与应用, 2021, 57(11): 135-139.