文档嵌入维度同胚——政府工作报告实例分析
Document Embedding Dimension Homeomorphism—Government Work Report Analysis
DOI: 10.12677/CSA.2020.106124, PDF,    国家自然科学基金支持
作者: 谢华伦, 梁 循:中国人民大学信息学院,北京;李梦蝶:湖南省长沙市开福区政务服务中心,湖南 长沙
关键词: 政府工作报告文档嵌入同胚维度Government Work Report Document Embedding Homeomorphism Dimension
摘要: 对政府工作报告大数据的智能分析,可以快速且充分地掌握其内在各因素的关联,支持决策者完成合理的决断。本文以2000年后共18年的全国31个省和直辖市县区级及以上的政府工作报告为分析对象,首先提出了基于文档嵌入同胚空间的最佳嵌入维度分析框架,在此基础上对政府工作报告的最佳文档嵌入维度同胚模型进行了省域文档可分性和省域文档相似性研究,最后给出了政府工作报告的文档聚类可分性和相似性差异的实验结果和分析。该模型得到的最佳文档嵌入向量能够有效地对政府工作报告的文档省域子空间进行划分,各地方政府工作报告的文档省域时间序列相似性差异凸显了它们在政治、经济、教育、文化等多方面的差距,实验同时发现在求解相似文档集上使用正则化后的政府文档向量的欧式距离能够等效于传统的余弦距离。本文提出的文档嵌入同胚分析框架不仅对我国智慧政务的建设具有一定的参考意义和应用价值,同时可以在上市公司公告等文档多分类任务中对深度信息挖掘、报告再解读和智能决策提供支持。
Abstract: The intelligent analysis of big data in government work reports can quickly and fully grasp the correlation of various internal factors and support decision maker to complete reasonable judgment. Taking the 18-year government work report as research target, which consists of 31 provinces and municipalities at the county level and above after 2000, this paper first proposes the optimal embedding dimension analysis framework based on document embedding homeomorphic space. And then on this basis, the paper studies the separability and similarity of provincial documents of the best document embedded dimension homeomorphism model of government work report. Finally, the experimental results and analysis of the document clustering separability and similarity difference in the government work report are given. The optimal document embedding vector obtained by the model can effectively divide the provincial subspace of the government work report, and the differences in the similarity of the provincial time series of the local government work report highlight their gaps in politics, economy, education, culture and so on. The experiment also found that the Euclidean distance using the regularized government document vector is equivalent to the traditional cosine distance. The document embedding analysis framework not only has certain reference significance and application value for the construction of smart government affairs in China, but also can support deep information mining, report reinterpretation and intelligent decision making in multi-category documents such as public company announcements.
文章引用:谢华伦, 梁循, 李梦蝶. 文档嵌入维度同胚——政府工作报告实例分析[J]. 计算机科学与应用, 2020, 10(6): 1194-1208. https://doi.org/10.12677/CSA.2020.106124

参考文献

[1] 魏伟, 郭崇慧, 陈静锋. 国务院政府工作报告(1954-2017)文本挖掘及社会变迁研究[J]. 情报学报, 2018, 37(4): 406-421.
[2] 何家莉. 覆盖近似空间的连续与同胚映射[J]. 纯粹数学与应用数学, 2019, 35(2): 229-234.
[3] Harris, Z. (1954) Distributional Structure. Word, 10, 146-162. [Google Scholar] [CrossRef
[4] Bengio, Y., Ducharme, R., Vincent, P. and Jauvin, C. (2003) Neural Probabilistic Language Model. Journal of Machine Learning Research, 3, 1137-1155.
[5] Yin, Z. and Shen, Y.Y. (2018) On the Dimensionality of Word Embedding. Neural Information Processing Sytems. ComputerScence, Montreal. http://arxiv.org/abs/1812.04224v1
[6] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. and Dean, J. (2013) Distributed Representations of Words and Phrases and Their Compositionality. http://arxiv.org/abs/1310.4546v1
[7] Le, Q. and Mikolov, T. (2014) Distributed Representations of Sentences and Documents. http://arxiv.org/abs/1405.4053
[8] 李欣, 李旸, 王素格. 面向情感聚类的文本相似度计算方法研究[J]. 中文信息学报, 2018, 32(5): 97-104.
[9] 朱小飞, 郭嘉丰, 程学旗, 杜攀. 基于流形排序的查询推荐方法[J]. 中文信息学报, 2011, 25(2): 38-43.
[10] 罗四维, 赵连伟. 基于谱图理论的流形学习算法[J]. 计算机研究与发展, 2006, 43(7): 1173-1179.
[11] Yin. (2018) Understand Functionality and Dimensionality of Vector Embeddings: The Distribu-tional Hypothesis, the Pairwise Innerproduct Loss and Its Bias-Variance Trade-Off.
https://arxiv.org/abs/1803.00502
[12] 张蕾, 崔勇, 刘静, 江勇, 吴建平. 机器学习在网络空间安全研究中的应用[J]. 计算机学报, 2018, 41(9): 1943-1975.
[13] Chen, T. and Guestrin, C. (2016) XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2016, 785-794. [Google Scholar] [CrossRef
[14] 徐春华, 刘力. 省域市场潜力、产业结构升级与城乡收入差距——基于空间关联与空间异质性的视角[J]. 农业技术经济, 2015(5): 34-46.
[15] 李荣. 官话方言的分区[J]. 方言, 1985(1): 2-5.
[16] Maaten, L. and Hinton, G. (2008) Visualizing Data Using t-SNE. Journal of Machine Learning Re-search, 9, 2579-2605.