DBSCAN优化算法在实验文本大数据分析中的应用研究
Application and Research of DBSCAN Optimization Algorithm in Big Data Analysis of Experimental Text
DOI: 10.12677/CSA.2020.105093, PDF,    国家自然科学基金支持
作者: 史婷婷, 徐龙琴:仲恺农业工程学院信息科学与技术学院,广东 广州;刘卫华*:广东司法警官职业学院,广东 广州;刘双印:仲恺农业工程学院信息科学与技术学院,广东 广州;仲恺农业工程学院广东省农产品安全大数据工程技术研究中心,广东 广州;仲恺农业工程学院广东省高校智慧农业工程技术研究中心,广东 广州
关键词: DBSCAN、密度聚类、文本聚类、实验大数据分析DBSCAN Density Clustering Text Clustering Experimental Big Data Analysis
摘要: 大数据是近年来计算机领域兴起的研究热点,通过聚类可以解决诸如数据挖掘、机器学习、文本处理等大数据领域问题。针对传统的DBSCAN算法参数需要人工设定,且算法速度无法适应大数据应用等问题,本文提出了一种DBSCAN优化算法。利用KD树加快查找邻域对象,显著减少算法的运行时间;同时,通过计算所有邻域对象的数学期望,实现密度阈值(Minpts)参数自适应;接着,设计了一种文本聚类流程,通过SD-TF-IDF算法对特征项的权值进行优化,进而完成对文本的聚类任务;最后,将其应用于高校计算机实验文本大数据的挖掘分析中,取得了良好的效果。
Abstract: Big data is a research hotspot emerging in the computer field in recent years. Clustering can solve problems in the field of big data, such as data mining, machine learning, and text processing. Aiming at the problems that parameters of traditional DBSCAN algorithm need to be set manually and the algorithm speed cannot adapt to the application of big data, a DBSCAN optimization algorithm was proposed. The KD tree was used to speed up the search for neighborhood objects, significantly reducing the running time of the algorithm; at the same time, the density threshold (Minpts) was adaptive by calculating the mathematical expectations of all neighborhood objects; then, a text clustering process was designed, and the weights of feature items were optimized through SD-TF-IDF to complete the text clustering task; finally, it was applied to the mining and analysis of big data of computer experimental text in colleges and universities, and good results had been achieved.
文章引用:史婷婷, 刘卫华, 刘双印, 徐龙琴. DBSCAN优化算法在实验文本大数据分析中的应用研究[J]. 计算机科学与应用, 2020, 10(5): 906-913. https://doi.org/10.12677/CSA.2020.105093

参考文献

[1] Li, Z., Yang, C., Liu, K., et al. (2016) Automatic Scaling Hadoop in the Cloud for Efficient Process of Big Geospatial Data. International Journal of Geo-Information, 5, 173. [Google Scholar] [CrossRef
[2] 李璐明, 蒋新华, 廖律超. 基于弹性分布数据集的海量空间数据密度聚类[J]. 湖南大学学报(自科版), 2015, 42(8): 116-124.
[3] Ester, M., Kriegel, H.P. and Xu, X. (1996) A Density-Based Algorithm for Discovering Clusters a Den-sity-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, 1996, 226-231.
[4] Manyika, J., Chui, M., Brown, B., et al. (2011) Big Data: The Next Frontier for Innovation, Competition and Productivity. McKinsey & Company, Taipei, 3-17.
[5] Lohr, S. (2012) The Age of Big Data. International Journal of Communications, Network and System Sci-ences, 16, 10-15.
[6] 刘远超, 王晓龙, 刘秉权, 等. 信息探索中的聚类分析技术[J]. 电子与信息学报, 2006(4): 29-32.
[7] Khan, M., Jin, Y., Li, M., et al. (2016) Hadoop Performance Modeling for Job Estimation and Resource Provisioning. IEEE Transactions on Parallel & Distributed Systems, 27, 441-454. [Google Scholar] [CrossRef
[8] Guo, Y., Rao, J., Cheng, D., et al. (2017) iShuffle: Improving Hadoop Performance with Shuffle-on-Write. IEEE Transactions on Parallel & Distributed Systems, 28, 11-20. [Google Scholar] [CrossRef
[9] 许芳芳. 基于DBSCAN优化算法的Web文本聚类研究[D]: [硕士学位论文]. 上海: 华东师范大学, 2011.
[10] 侯丽利, 董书宝. 基于NoSQL数据库的大数据查询技术的研究与应用[J]. 无线互联科技, 2015(1): 147-154.
[11] 傅华忠, 茅剑. 基于DBSCAN聚类算法的Web文本挖掘[J]. 科技信息, 2007(1): 55-56.
[12] 牛新征, 佘堏. 面向大规模数据的快速并行聚类划分算法研究[J]. 计算机科学, 2012, 39(1): 134-137.
[13] 闫安, 刘琪林. 一种基于参考点的快速密度聚类算法[J]. 微电子学与计算机, 2017, 34(10): 32-35.
[14] 张振亚, 程红梅, 王进, 等. 面向凝聚式层次聚类算法实现的矩阵存储数据结构研究[J]. 计算机科学, 2006, 33(1): 14-17.
[15] 张忠林, 曹志宇, 李元韬. 基于加权欧式距离的k_means算法研究[J]. 郑州大学学报(工学版), 2010, 31(1): 89-92.
[16] Hartigan, J.A. (1979) A K-Means Clustering Algorithm. Applied Statistics, 28, 100-108. [Google Scholar] [CrossRef
[17] 赵慧, 刘希玉, 崔海青. 网格聚类算法[J]. 计算机技术与发展, 2010, 20(9): 83-85.