兼类词概率分布计量考察及语法搭配模式在中文信息处理中的应用
A Study of the Probability Distribution and Grammatical Collocation Patterns of Multi-Category Words in Chinese Information Processing
DOI: 10.12677/ML.2021.92072, PDF,    国家科技经费支持
作者: 王浩学, 徐艳华*:鲁东大学文学院,山东 烟台
关键词: 兼类词语法搭配语料库应用词性标注Multi-Category Words Grammatical Collocation Corpus Application Part-of-Speech Tagging
摘要: 在词性标注的过程中,汉语中兼类词的存在是影响词性标注准确率的主要原因。本研究以三部词典标注一致的78个形名兼类词为测试对象,基于规则和统计相结合的词性标注方法,将统计的兼类词分布概率与语法搭配规则结合起来,利用兼类词语法搭配模式构建规则库,对国家语委现代汉语通用平衡语料库标注的兼类词结果进行修正,准确率可以提高14.57%。
Abstract: In the process of part-of-speech tagging, the existence of multi-category words in Chinese is the main reason that affects the accuracy of part-of-speech tagging. In this study, 78 adjective-noun multi-category words of the same part-of-speech tagging in the three dictionaries are the test objects. The part-of-speech tagging method based on the combination of rules and statistics combines the statistical distribution probability of multi-category words with grammatical collocation rules, and builds a rule database using the grammatical collocation mode of multi-category words. The rule database corrects the results of the multi-category words tagged by the modern Chinese corpus of State Language Commission, and the accuracy rate can be increased by 14.57%.
文章引用:王浩学, 徐艳华. 兼类词概率分布计量考察及语法搭配模式在中文信息处理中的应用[J]. 现代语言学, 2021, 9(2): 524-529. https://doi.org/10.12677/ML.2021.92072

参考文献

[1] 胡明扬. 现代汉语的词类问题[C]//世界汉语教学学会. 第六届国际汉语教学讨论会论文选. 世界汉语教学学会:世界汉语教学学会, 1999: 10.
[2] 张虎, 郑家恒, 刘江. 语料库词性标注一致性检查方法研究[J]. 中文信息学报, 2004(5): 11-16.
[3] 宗成庆. 统计自然语言处理[M]. 北京: 清华大学出版社, 2013.
[4] 张民, 李生, 赵铁军, 张艳风. 统计与规则并举的汉语词性自动标注算法[J]. 软件学报, 1998(2): 55-59.