基于条件随机场模型的汉语主谓短语自动识别研究
A CRF-Based Approach for Automatic Identification of Chinese Subject-Predicate Phrases
摘要: 主谓结构是汉语句子的核心语法骨架,其精准识别是自然语言处理(NLP)领域语义解析、信息抽取等下游任务的关键前提。针对中文语言结构复杂性给主谓短语识别带来的挑战,本文提出基于条件随机场(CRF)的汉语主谓短语自动识别方法,以提升识别准确性与可靠性。研究以清华-汉语句法树库(TCT)为语料来源,构建含39595个标注样本的数据集。预处理阶段,通过自定义转换函数解决原始语料编码问题,用正则表达式匹配主谓结构边界,完成“起始(B)–内部(I)–其他(O)”标签标注,并将数据格式化为“词语–词性–标签”三元组,满足CRF模型训练需求。特征工程中,设计word2features函数,抓取词形及变形、词性、上下文前后1~2词及组合特征、句子边界(BOS/EOS)、词缀等多维度特征,为模型提供支撑。采用sklearn-crfsuite库实现CRF,以L-BFGS为优化算法,设正则化系数(C1 = 0.2、C2 = 0.1)与最大迭代次数200次,将数据集按9:1划分为训练集35635个样本与测试集3960个样本。实验显示,模型加权F1值0.7459,I标签F1值0.7541,B标签F1值0.6739;加权精确率0.7675、召回率0.7257,模型对主谓结构内部成分识别较好,但起始边界及长距离依赖识别需优化。研究证实,词性与上下文组合特征可提升模型性能,为汉语句法结构自动识别提供参考。
Abstract: The subject-predicate structure constitutes the fundamental syntactic framework of Chinese sentences, and its precise identification is essential for various downstream natural language processing (NLP) tasks, including semantic parsing and information extraction. To address the challenges posed by the structural complexity of Chinese in recognizing subject-predicate phrases, this paper introduces an automated identification method based on Conditional Random Fields (CRF), aiming to improve both the accuracy and robustness of recognition. Using the Tsinghua Chinese Treebank (TCT) as the corpus, we constructed a dataset of 39,595 annotated samples. During preprocessing, custom conversion functions were applied to resolve encoding inconsistencies, and regular expressions were used to demarcate subject-predicate boundaries, followed by annotation using the “Begin (B)-Inside (I)-Other (O)” labeling scheme. The data were formatted into “word-part-of-speech-label” triplets to facilitate CRF model training. For feature engineering, we designed a word2features function to extract multi-dimensional features, encompassing word form, morphological variations, part-of-speech tags, contextual tokens within a window of ±2 words and their combinations, sentence boundaries (BOS/EOS), and affixal information. The CRF model was implemented using the sklearn-crfsuite library, optimized with the L-BFGS algorithm, with regularization parameters set to C1 = 0.2 and C2 = 0.1, and a maximum of 200 iterations. The dataset was partitioned into training and test sets in a 9:1 ratio, containing 35,635 and 3,960 samples, respectively. Experimental results indicate a weighted F1-score of 0.7459, with F1-scores of 0.7541 for I-labels and 0.6739 for B-labels. The weighted precision and recall were 0.7675 and 0.7257, respectively. While the model demonstrates strong performance in identifying internal elements of subject-predicate structures, there remains room for improvement in detecting initial boundaries and handling long-distance dependencies. This study confirms that integrating part-of-speech and composite contextual features enhances model performance, offering a valuable reference for the automated recognition of syntactic structures in Chinese.
文章引用:何一凡. 基于条件随机场模型的汉语主谓短语自动识别研究[J]. 现代语言学, 2025, 13(12): 349-356. https://doi.org/10.12677/ml.2025.13121271

参考文献

[1] 桑德拉·库布利克, 舒巴姆·萨博, 李兆钧. GPT使用OpenAI API构建NLP产品的终极指南[M]. 北京: 机械工业出版社, 2024.
[2] 钱小飞. 汉语长名词短语识别研究[M]. 上海: 上海大学出版社, 2023.
[3] 张金柱, 于文倩. 基于短语表示学习的主题识别及其表征词抽取方法研究[J]. 数据分析与知识发现, 2021, 5(2): 50-60.
[4] 谢靖, 苏新宁, 沈思. CSSCI语料中短语结构标注与自动识别[J]. 现代图书情报技术, 2012(12): 32-38.
[5] 徐艳华. 基于语料库的基本名词短语研究[J]. 语言文字应用, 2008(1): 120-125.
[6] 孔玲, 胡昊天, 张卫, 等. 跨学科知识扩散视域下学科交叉科学术语识别与特征计算[J]. 图书情报工作, 2024, 68(12): 119-137.
[7] 陈禹, 刘林旭. 现代汉语主谓名素复现及其统计分析[J]. 语言文字应用, 2021(2): 79-88.
[8] 姚从军, 罗丹. 面向信息处理的汉语主谓谓语句的组合范畴语法分析[J]. 中国社会科学院研究生院学报, 2019(2): 14-24.
[9] 张岳, 滕志扬, 张梅山, 等. 自然语言处理基于机器学习视角[M]. 北京: 机械工业出版社, 2024.
[10] 刘洪超, 黄居仁, 侯仁魁, 等. 基于语言学特征向量和词嵌入向量的汉语动词事件类型预测[J]. 中文信息学报, 2018, 32(1): 26-33.
[11] 邱锡鹏. 神经网络与深度学习[M]. 北京: 机械工业出版社, 2024.