兼类词概率分布计量考察及语法搭配模式在中文信息处理中的应用
A Study of the Probability Distribution and Grammatical Collocation Patterns of Multi-Category Words in Chinese Information Processing
摘要: 在词性标注的过程中,汉语中兼类词的存在是影响词性标注准确率的主要原因。本研究以三部词典标注一致的78个形名兼类词为测试对象,基于规则和统计相结合的词性标注方法,将统计的兼类词分布概率与语法搭配规则结合起来,利用兼类词语法搭配模式构建规则库,对国家语委现代汉语通用平衡语料库标注的兼类词结果进行修正,准确率可以提高14.57%。
Abstract:
In the process of part-of-speech tagging, the existence of multi-category words in Chinese is the main reason that affects the accuracy of part-of-speech tagging. In this study, 78 adjective-noun multi-category words of the same part-of-speech tagging in the three dictionaries are the test objects. The part-of-speech tagging method based on the combination of rules and statistics combines the statistical distribution probability of multi-category words with grammatical collocation rules, and builds a rule database using the grammatical collocation mode of multi-category words. The rule database corrects the results of the multi-category words tagged by the modern Chinese corpus of State Language Commission, and the accuracy rate can be increased by 14.57%.
参考文献
|
[1]
|
胡明扬. 现代汉语的词类问题[C]//世界汉语教学学会. 第六届国际汉语教学讨论会论文选. 世界汉语教学学会:世界汉语教学学会, 1999: 10.
|
|
[2]
|
张虎, 郑家恒, 刘江. 语料库词性标注一致性检查方法研究[J]. 中文信息学报, 2004(5): 11-16.
|
|
[3]
|
宗成庆. 统计自然语言处理[M]. 北京: 清华大学出版社, 2013.
|
|
[4]
|
张民, 李生, 赵铁军, 张艳风. 统计与规则并举的汉语词性自动标注算法[J]. 软件学报, 1998(2): 55-59.
|