基于Spark的动态聚类算法研究
Research on Dynamic Clustering Algorithm Based on Spark Framework
DOI: 10.12677/CSA.2016.611086, PDF, HTML, XML,  被引量 下载: 1,924  浏览: 2,834 
作者: 张伯涛*, 李建华, 范 磊:上海交通大学信息安全工程学院,上海
关键词: D-StreamPDStreamSpark动态聚类算法D-Stream PDStream Spark Dynamic Clustering Algorithm
摘要: 针对数据流的聚类算法,近年来取得了有效的进展,出现了许多卓有成效的算法。随着信息采集技术的进步,需要处理的数据量越来越大,需要研究针对数据流的并行聚类算法。本文基于串行的数据流聚类算法D-Stream作出并行化改进,用通用的大数据处理框架Spark设计了一个基于分布式架构运行的动态数据聚类算法PDStream。实验结果表明,该算法具有更高的效率和良好的扩展性,能够实现分布式架构下的流数据动态聚类。
Abstract: In the era of big data, with the rapid growth of data size, the requirements of data processing in-crease constantly. It has put forward many effective algorithms for data stream clustering these years. However, with the continuous development of social technology, single machine environ-ment has been difficult to meet the needs of data mining. Cluster environment is used more for information collection and data processing, the traditional clustering algorithm does not adapt well to the new processing requirements. This paper made some improvements from the data stream clustering algorithm D-Stream, used the big data processing framework Spark and designed a dynamic data clustering algorithm PDStream based on distributed architecture. The new algorithm is proved to be more efficient and able to perform dynamic clustering tasks under distributed architecture from the results of experiment.
文章引用:张伯涛, 李建华, 范磊. 基于Spark的动态聚类算法研究[J]. 计算机科学与应用, 2016, 6(11): 715-727. http://dx.doi.org/10.12677/CSA.2016.611086

参考文献

[1] Aggarwal, C.C., Han, J., Wang, J., et al. (2003) A Framework for Clustering Evolving Data Streams. Vldb, 29, 81-92.
https://doi.org/10.1016/b978-012722442-8/50016-1
[2] Cao, F., Ester, M., Qian, W., et al. (2006) Density-Based Clustering over an Evolving Data Stream with Noise. Siam International Conference on Data Mining, Bethesda, 20-22 April 2006, 328-339.
[3] Chen, Y. and Tu, L. (2007) Density-Based Clustering for Real-Time Stream Data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, August 2007, 133-142.
[4] Amini, A. (2012) DENGRIS-Stream: A Density-Grid Based Clustering Algorithm for Evolving Data Streams over Sliding Window. International Conference on Data Mining and Computer Engineering.
[5] Bhatnagar, V., Kaur, S. and Chakravarthy, S. (2014) Clustering Data Streams Using Grid-Based Synopsis. Knowledge & Information Systems, 41, 127-152.
https://doi.org/10.1007/s10115-013-0659-1
[6] Forestiero, A., Pizzuti, C. and Spezzano, G. (2013) A Single Pass Algorithm for Clustering Evolving Data Streams Based on Swarm Intelligence. Data Mining & Knowledge Discovery, 26, 1-26.
https://doi.org/10.1007/s10618-011-0242-x
[7] Eberhart, R.C. (2001) Swarm Intelligence.
[8] Hadoop, W.T. (2010) The Definitive Guide. O’reilly Media Inc Gravenstein Highway North, 215, 1-4.
[9] Zaharia, M., Chowdhury, M., Franklin, M.J., et al. (2010) Spark: Cluster Computing with Working Sets. 10.
[10] 夏俊鸾, 邵赛赛. Spark Streaming: 大规模流式数据处理的新贵[J]. 程序员, 2014(2): 44-47.
[11] 张新有, 曾华燊, 贾磊. 入侵检测数据集KDD CUP99研究[J]. 计算机工程与设计, 2010, 31(22): 4809-4812.