一种面向增量数据的重叠聚类算法研究
Study on an Overlapping Clustering Algorithm for Incremental Data
DOI: 10.12677/csa.2025.1510258, PDF,    科研立项经费支持
作者: 李 峰:河北金融学院河北省金融科技应用重点实验室,河北 保定;韩祝华*:河北金融学院统计与数据科学学院,河北 保定;许美玲:河北金融学院信息与人工智能学院,河北 保定
关键词: 重叠聚类调和均值隶属函数增量数据K-MeansOverlapping Clustering Harmonic Mean Membership Function Incremental Data K-Means
摘要: 数据的在线聚类问题分为完全在线聚类和半在线聚类两种。而增量数据的聚类属于半在线聚类,是在已有数据聚类的基础上进行聚类。另外,由于大多数的聚类算法只会把一个数据点分配给唯一聚类,但很多现实环境的数据集都包含交叉重复信息,而重叠聚类允许一个数据点属于多个聚类。传统的重叠聚类方法overlapping k-means (OKM)作为对k-means的扩展,存在聚类效率低和对聚类中心初始化敏感的问题。为解决以上问题,本文提出一种面向增量数据的重叠k-means混合模型,其主要思想是利用KHM的输出对OKM方法的聚类中心进行初始化。该模型首先应用k调和均值算法与重叠k均值算法来对已有数据进行重叠聚类;然后迭代使用OKM算法对增量数据进行聚类,即k-harmonic means & iterative overlapping k-means混合算法(KHM-IOKM),算法时间复杂度为 O( kn ) 。实验结果显示:在仿真数据集和UCI真实数据集上与传统的OKM和KHM算法进行比较,该方法能够有效降低聚类中心初始化敏感的问题,并且相比于传统的聚类方法能获得更好的聚类效果。
Abstract: The online clustering problem of data is divided into two types: full online clustering and semi-online clustering. The clustering of incremental data belongs to semi-online clustering, which is based on existing data clustering. In addition, most clustering algorithms produce exclusive clusters, meaning that each data point can just belong to one cluster. In fact, a great many real-world datasets have inherently overlapping information, while the overlapping clustering methods allow one data point to belong to more than one cluster. The traditional overlapping clustering method is overlapping k-means (OKM). As an extension of the k-means algorithm, it has the problems of low clustering efficiency and sensitivity to the selection of the initial clustering center. In order to solve the above problems, in this study, we propose an overlapping k-means hybrid model for incremental data, whose main idea is to use the output of KHM to initialize the clustering center of the OKM method. The model first uses k-harmonic mean algorithm and overlapping k-means algorithm to overlap clustering existing data; then iteratively uses OKM algorithm to cluster incremental data, that is k-harmonic means & iterative overlapping k-means hybrid algorithm (KHM-IOKM), the time complexity is O( kn ) . By comparing with the traditional OKM and KHM algorithms on synthetic dataset and UCI real dataset, the experimental results demonstrate that the proposed method can effectively reduce the sensitivity of clustering center initialization and obtain better clustering than traditional clustering methods.
文章引用:李峰, 韩祝华, 许美玲. 一种面向增量数据的重叠聚类算法研究[J]. 计算机科学与应用, 2025, 15(10): 163-175. https://doi.org/10.12677/csa.2025.1510258

参考文献

[1] Choromanska, A. and Monteleoni, C. (2012) AISTATS: Online Clustering with Experts. Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, La Palma, 2 May 2012, 227-235.
[2] Zhu, W., Yin, J. and Xie, Y. (2006) Arbitrary Shape Cluster Algorithm for Clustering Data Stream. Journal of Software, 17, 379-387. [Google Scholar] [CrossRef
[3] 朱蔚恒, 印鉴, 谢益煌. 基于数据流的任意形状聚类算法[J]. 软件学报, 2006, 17(3): 379-387.
[4] Charikar, M., Chekuri, C., Feder, T. and Motwani, R. (1997) Incremental Clustering and Dynamic Information Retrieval. Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, El Paso, 4-6 May 1997, 626-635. [Google Scholar] [CrossRef
[5] Coates, A., Ng, A.Y. and Lee, H. (2011) AISTATS: An Analysis of Single-Layer Networks in Unsupervised Feature Learning. JMLR Proceedings, 15, 215-223.
[6] Cleuziou, G. (2010) Two Variants of the OKM for Overlapping Clustering. In: Guillet, F., Ritschard, G., Zighed, D.A. and Briand, H., Eds., Studies in Computational Intelligence, Springer, 149-166. [Google Scholar] [CrossRef
[7] Ben NCir, C. and Essoussi, N. (2010) KDIR: Kernel Overlapping K-Means for Clustering in Feature Space. International Conference on Knowledge Discovery and Information Retrieval, Valencia, 25-28 October 2010, 250-256.
[8] N’Cir, C.B. and Essoussi, N. (2012) Overlapping Patterns Recognition with Linear and Non-Linear Separations Using Positive Definite Kernels. International Journal of Computer Applications, 56, 1-8. [Google Scholar] [CrossRef
[9] Aroche-Villarruel, A.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera-López, J.A. and Pérez-Suárez, A. (2014) Study of Overlapping Clustering Algorithms Based on Kmeans through Fbcubed Metric. In: Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-Lopez, J.A., Salas-Rodríguez, J. and Suen, C.Y., Eds., Lecture Notes in Computer Science, Springer International Publishing, 112-121. [Google Scholar] [CrossRef
[10] Zhang, B. (2000) Generalized K-Harmonic Means. Technical Report, Hewlett-Packard Laboratoris.
[11] N’Cir, C.B., Cleuziou, G. and Essoussi, N. (2015) Overview of Overlapping Partitional Clustering Methods. In: Celebi, M., Ed., Partitional Clustering Algorithms, Springer International Publishing, 245-275. [Google Scholar] [CrossRef