基于马尔可夫动态编码的谷歌图书语料库质量方法
Google Books Corpus Quality Method Based on Markov Dynamic Coding
DOI: 10.12677/CSA.2023.134073, PDF,   
作者: 宋玉玲:温州大学计算机与人工智能学院,浙江 温州
关键词: 谷歌图书语料库马尔可夫模型时间序列异常检测Google Books Corpus Markov Model Time Series Anomaly Detection
摘要: 语料库是自然语言处理任务的关键,谷歌图书语料库是迄今为止最大的历时语料库,被广泛应用于从时间、空间维度上评估学科、语言甚至是文化等领域在社会发展中的现象和规律,但因其构建过程中的识别问题、元数据问题等原因被很多学者质疑。目前常见的处理方法主要是从语料库中提取所有可能的数据和从原数据进行预处理,这些方法耗时且费力。本文提出将语料库噪声问题转化为时间序列异常检测问题,使用传统的时间序列模型和马尔可夫动态编码去实现时间序列异常检测。实验结果表明,马尔可夫不仅可以保存时间相关性和频率结构,而且提供了一种自然的反向操作——将图形映射回时间序列,克服了传统时间序列模型的缺点,最终有效地解决了语料库的局部质量对齐问题。
Abstract: The corpus is the key to natural language processing tasks. The Google Books corpus is by far the largest ephemeral corpus, which is widely used to evaluate the phenomena and patterns of disciplines, languages, and even cultures in social development from temporal and spatial dimensions, but it has been questioned by many scholars due to the identification problem and metadata problem in its construction. The current common processing methods mainly extract all possible data from the corpus and preprocess from the original data, which are time-consuming and laborious. In this paper, we propose to transform the corpus noise problem into a time series anomaly detection problem by using the traditional time series model and Markov dynamic coding to achieve time series anomaly detection. Experimental results show that Markov not only preserves temporal correlation and frequency structure, but also provides a natural inverse operation—mapping graphs back to time series, which overcomes the shortcomings of the traditional time series model and finally effectively solves the local quality alignment problem of the corpus.
文章引用:宋玉玲. 基于马尔可夫动态编码的谷歌图书语料库质量方法[J]. 计算机科学与应用, 2023, 13(4): 745-753. https://doi.org/10.12677/CSA.2023.134073

参考文献

[1] Michel, J.B., Kui, S.Y., Presser, A.A., et al. (2011) Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331, 176-182. [Google Scholar] [CrossRef] [PubMed]
[2] Lin, Y., Michel, J.B., Lieberman, A.E., et al. (2012) Syntactic Annotations for the Google Books Ngram Corpus. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, 8-14 July 2012, 169-174.
[3] Twenge, J.M., Campbell, W.K. and Gentile, B. (2012) Male and Female Pronoun Use in U.S. Books Reflects Women’s Status, 1900-2008. Sex Roles, 67, 488-493. [Google Scholar] [CrossRef
[4] Twenge, J.M., Campbell, W.K. and Gentile, B. (2013) Changes in Pronoun Use in American Books and the Rise of Individualism, 1960-2008. Journal of Cross-Cultural Psychology, 44, 406-415. [Google Scholar] [CrossRef
[5] Kesebir, P. and Kesebir, S. (2012) The Cultural Salience of Moral Character and Virtue Declined in Twentieth Century America. Journal of Positive Psychology, 7, 471-480. [Google Scholar] [CrossRef
[6] Twenge, J.M., Van Landingham, H. and Keith, C.W. (2017) The Seven Words You Can Never Say on Television: Increases in the Use of Swear Words in American Books, 1950-2008. SAGE Open, 7, 1-8. [Google Scholar] [CrossRef
[7] Greenfield, P.M. (2013) The Changing Psychology of Culture from 1800 through 2000. Psychological Science, 24, 1722-1731. [Google Scholar] [CrossRef] [PubMed]
[8] Hamamura, T. and Xu, Y. (2015) Changes in Chinese Culture as Examined through Changes in Personal Pronoun Usage. Journal of Cross-Cultural Psychology, 46, 930-941. [Google Scholar] [CrossRef
[9] Xu, Y. and Hamamura, T. (2014) Folk Beliefs of Cultural Changes in China. Frontiers in Psychology, 5, Article 1066. [Google Scholar] [CrossRef] [PubMed]
[10] 邵斌. 浙江文化关键词在英语世界的影响力研究——基于文化组学的视角[J]. 浙江学刊, 2017(2): 201-207.
[11] 曾凡斌, 陈荷. 基于谷歌图书语料库大数据的百年传播学发展研究[J]. 现代传播: 中国传媒大学学报, 2018, 40(3): 135-145.
[12] 陈云松. 大数据中的百年社会学——基于百万书籍的文化影响力研究[J]. 社会学研究, 2015, 30(1): 23-48.
[13] Duguid, P. (2007) Inheritance and Loss? A Brief Survey of Google Books. First Monday, 12. [Google Scholar] [CrossRef
[14] Solovyev, V. and Akhtyamova, S. (2019) Linguistic Big Data: Problem of Purity and Representativeness. CEUR Workshop Proceedings, Vol. 2523, 193-204.
[15] Solovyev, V.D., Bochkarev, V.V. and Akhtyamova, S.S. (2020) Google Books Ngram: Problems of Representativeness and Data Reliability. 21st International Conference, DAMDID/RCDL 2019, Kazan, 15-18 October 2019, 147-162. [Google Scholar] [CrossRef
[16] Pettit, M. (2016) Historical Time in the Age of Big Data Cultural Psychology, Historical Change, and the Google Books. History of Psychology, 19, 141-153. [Google Scholar] [CrossRef] [PubMed]
[17] Pechenick, E.A., Danforth, C.M. and Dodds, P.S. (2015) Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution. PLOS ONE, 10, e0137041. [Google Scholar] [CrossRef] [PubMed]
[18] Koplenig, A. (2017) The Impact of Lacking Metadata for the Meas-urement of Cultural and Linguistic Change Using the Google Ngram Data Sets-Reconstructing the Composition of the German Corpus in Times of WWII. Digital Scholarship in the Humanities, 32, 169-188.
[19] James, R. and Weiss, A. (2012) An Assessment of Google Books’ Metadata. Journal of Library Metadata, 12, 15-22. [Google Scholar] [CrossRef
[20] Pechenick, E.A., Danforth, C.M. and Dodds, P.S. (2017) Is Lan-guage Evolution Grinding to a Halt? The Scaling of Lexical Turbulence in English Fiction Suggests It Is Not. Journal of Computational Science, 21, 24-37. [Google Scholar] [CrossRef
[21] Younes, N. and Reips, U.D. (2018) The Changing Psychology of Cul-ture in German-Speaking Countries: A Google Ngram Study. International Journal of Psychology, 53, 53-62. [Google Scholar] [CrossRef] [PubMed]
[22] Younes, N. and Reips, U.D. (2019) Guideline for Improving the Reliability of Google Ngram Studies: Evidence from Religious Terms. PLOS ONE, 14, e0213554. [Google Scholar] [CrossRef] [PubMed]
[23] Bochkarev, V., Solovyev, V. and Wichmann, S. (2014) Uni-versals versus Historical Contingencies in Lexical Evolution. Journal of the Royal Society Interface, 11, Article ID: 20140841. [Google Scholar] [CrossRef] [PubMed]
[24] Ho, S.L. and Xie, M. (1998) The Use of ARIMA Models for Reliability Forecasting and Analysis. Computers and Industrial Engineering, 35, 213-216. [Google Scholar] [CrossRef
[25] Wang, Z. and Oates, T. (2015) Imaging Time-Series to Im-prove Classification and Imputation. Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, 25-31 July 2015, 3939-3945.
[26] Liu, L. and Wang, Z. (2018) Encoding Temporal Markov Dynamics in Graph for Visualizing and Mining Time Series. The Workshops of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, 2-7 February 2018, 178-184. http://arxiv.org/abs/1610.07273
[27] Zhao, N., Zhu, J., Liu, R., et al. (2019) Label-Less: A Semi-Automatic Labelling Tool for KPI Anomalies. IEEE INFOCOM 2019—IEEE Conference on Computer Communications, Paris, 29 April-2 May 2019, 1882-1890. [Google Scholar] [CrossRef