融合主题模型和图神经网络的无监督文档聚类模型
Unsupervised Document Clustering Model Based on Topic Model and Graph Neural Network
DOI: 10.12677/CSA.2022.127180, PDF,    科研立项经费支持
作者: 张出阳, 柴变芳:河北地质大学信息工程学院,河北 石家庄;赵晓鹏:河北省财政厅信息中心,河北 石家庄
关键词: 文档聚类主题发现图神经网络词表示Document Clustering Topic Discovery Graph Neural Networks Word Embeddings
摘要: TextING (Inductive Text classification via GNN)模型是一种流行的图神经网络文本分类方法,其为每个文档构建词共现文档图,基于GCN (Graph Convolutional Networks)在所有文档词图上学习文档表示,进而通过监督的方式训练文档分类模型。但该方法需要大量文档类别标签,且基于词图的文档表示不能充分学到整个文档集合的全局特征。针对此问题,提出一种无监督的文本分类模型。该模型首先利用ETM (Embedd Topic Model)主题发现模型学习包含全局词特征的文档表示,并对ETM学到的文档主题表示进行Kmeans聚类作为文档的伪类标,再利用TextING训练文档分类模型。在真实文档数据集上的结果表明该方法比主流无监督文档聚类准确性高。
Abstract: TextING (Inductive Text classification via GNN) model is a popular text classification method based on graph neural network. According to the Graph Convolutional Networks (GCN), the document representation is learned on all the document word graphs, and then the document classification model is trained by supervision. However, this method requires a large number of document category labels, and the word graph-based document representation cannot fully learn the global characteristics of the entire document set. To solve this problem, an unsupervised text classification model is proposed. The Embedd Topic Model (ETM) was used to learn the document representation containing global word features. Kmeans clustering was applied to the document topic representation learned by ETM as the pseudo class standard of the document and then use TextING to train the document classification model. The results on real document datasets show that the proposed method is more accurate than the mainstream unsupervised document clustering.
文章引用:张出阳, 赵晓鹏, 柴变芳. 融合主题模型和图神经网络的无监督文档聚类模型[J]. 计算机科学与应用, 2022, 12(7): 1795-1800. https://doi.org/10.12677/CSA.2022.127180

参考文献

[1] Yao, L., Mao, C. and Luo, Y. (2018) Graph Convolutional Networks for Text Classification. 33rd AAAI Conference on Artificial Intelligence (AAAI 2019). [Google Scholar] [CrossRef
[2] Huang, L., Ma, D., Li, S., et al. (2019) Text Level Graph Neural Network for Text Classification. In: Proceedings of the 2019 Conference on Empiri-cal Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), Hong Kong, 3444-3450. [Google Scholar] [CrossRef
[3] Zhang, Y., Yu, X., Cui, Z., et al. (2020) Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 334-339. [Google Scholar] [CrossRef
[4] Dieng, A.B., Ruiz, F. and Blei, D.M. (2020) Topic Modeling in Embedding Spaces. Transactions of the Association for Computational Linguistics, 8, 439-453. [Google Scholar] [CrossRef
[5] Caron, M., Bojanowski, P., Joulin, A. and Douze, M. (2018) Deep Clustering for Unsupervised Learning of Visual Features. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11218, Springer, Cham. [Google Scholar] [CrossRef
[6] Li, X., Zhang, H. and Zhang. R. (2021) Adaptive Graph Au-to-Encoder for General Data Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 1. [Google Scholar] [CrossRef
[7] Sun, K., Lin, Z. and Zhu, Z. (2020) Multi-Stage Self-Supervised Learning for Graph Convolutional Networks on Graphs with Few Labeled Nodes. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 5892-5899. [Google Scholar] [CrossRef
[8] Nguyen, D.Q., Billingsley, R., Du, L., et al. (2018) Improving Topic Models with Latent Feature Word Representations. Transactions of the Association for Computational Linguistics, 3, 299-313.