文本挖掘在商品评论中的应用研究——以烟草评论为例
Application Research of Text Mining in Commodity Reviews—Taking Tobacco Reviews as an Example
DOI: 10.12677/ASS.2018.712292, PDF,   
作者: 贾春光:云南财经大学,统计与数学学院,云南 昆明
关键词: 网络爬虫文本挖掘情感分析烟草Web Crawler Text Mining Sentiment Analysis Tobacco
摘要: 随着互联网技术的飞速发展与普及,网络上提供了很多用户对商品评论的地方,这些评论信息直接体现了客户对商品功能或性能方面的情感态度,因此对商品评论进行文本挖掘具有重大意义。然而网络评论数据量巨大,多半为半结构化、非结构性化数据,且其中的无用评论较多,如何快速获取商品评论语料以及选取何种方式分析成为研究的关键问题。首先,本文利用Python通过爬虫获取烟草的评论语料,并对语料进行简繁转化、错别字替换、无用评论剔除等数据预处理操作,接下来在把评论语料初步分为正面情感和反面情感的基础上,基于情感词典、程度副词词典、否定词词典计算消费者对烟草的情感评分。结果表明:国内对本商品的情感评分还是比较高的,且长江沿岸省份的评分稍高于其他地区。
Abstract: With the rapid development and popularization of Internet technology, the Internet provides a lot of places for users to comment on products. These comments directly reflect the customer’s emotional attitude towards the function or performance of the product. Therefore, text mining of product reviews is of great significance. However, the amount of online commentary data is huge, mostly semi-structured and unstructured data, and there are many useless comments. How to quickly obtain commodity review corpus and select which method to analyze becomes a key issue for research. First of all, this paper uses Python to obtain tobacco commentary corpus through crawlers, and performs data preprocessing operations such as simplification and corpus transformation, typos replacement, useless comment culling, etc., and then based on the preliminary categorization of the corpus into positive and negative emotions. The emotional score of tobacco is calculated based on the sentiment dictionary, the degree adverb dictionary, and the negative word dictionary. The results show that the domestic emotional scores on this commodity are still relatively high, and the scores of the provinces along the Yangtze River are slightly higher than other regions.
文章引用:贾春光. 文本挖掘在商品评论中的应用研究——以烟草评论为例[J]. 社会科学前沿, 2018, 7(12): 1962-1973. https://doi.org/10.12677/ASS.2018.712292

参考文献

[1] 董日壮, 郭曙超. 网络爬虫的设计与实现[J]. 电脑知识与技术, 2014, 10(17): 3986-3988.
[2] 中国互联网信息中心. http://www.cnnic.net.cn/
[3] 王贵烽. 汽车文本评论的情感极性分析[D]: [硕士学位论文]. 北京: 首都经济贸易大学, 2018.
[4] Hatzivassiloglou, V. and McKeown, K.R. (1997) Predicting the Semantic Orientation of Adjectives. Proceedings of the EACL-1997, Madrid, 7-12 July 1997, 174-181.
[5] Turney, P. (2002) Thumbs up or Thumbs down: Semantic Orientation Applied to Unsupervised Classification of Reviews. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, 7-12 July 2002, 417-424.
[6] 朱嫣岚, 闵锦, 周雅倩, 等. 基于HotNet的词汇语义倾向计算[J]. 中文信息学报, 2006, 20(1): 14-20.
[7] 徐琳宏, 林鸿飞, 杨志豪. 基于语义理解的文本倾向性识别机制[J]. 中文信息学报, 2007, 21(1): 96-100.
[8] 闻彬, 何婷婷, 罗乐, 等. 基于语义理解的文本情感分类方法研究[J]. 计算机科学, 2010, 37(6): 261-264.
[9] Pang, B. and Lee, L. (2005) Seeing Stars: Exploiting Class Relationships for Sentiment Categorization wiTh Respect to Rating Scale. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Morristown, 25-30 June 2005, 115-124. [Google Scholar] [CrossRef
[10] 唐慧丰, 谭松波, 程学旗. 基于监督学习的中文情感分类技术比较研究[J]. 中文信息学报, 2007, 21(6): 88-94.
[11] 伍星, 河中市, 黄永文. 基于弱监督学习的产品特征抽取[J]. 计算机工程, 2009, 35(13): 199-201.
[12] 王献伟. 文本情感分析在商品评论中的应用研究[D]: [硕士学位论文]. 杭州: 浙江工商大学, 2018.
[13] 周德翰, 李舟军. 高性能网络爬虫:研究综述[J]. 计算机科学, 2009, 36(8): 26-29.
[14] 周茜. 基于网络爬虫的信息采集分类系统设计与实现[D]: [硕士学位论文]. 厦门: 厦门大学, 2013.
[15] 涂铭, 刘祥, 刘树春. Python自然语言处理实战(核心技术与算法) [M]. 北京: 机械工业出版社, 2018: 38-58.
[16] 儒小逸. 为什么Python适合写爬虫? [EB/OL]. https://www.cnblogs.com/benzone/p/5854084.html, 2016-09-08.
[17] 於伟. 中文微博情感词典的构建研究与应用[D]: [硕士学位论文]. 上海: 上海师范大学, 2017.