基于差异化创造性搜索算法的数据属性噪声修复方法及其应用
Data Attribute Noise Repair Method Based on Differential Creative Search Algorithm and Its Applications
摘要: 水质数据在采集和传输过程中常受到属性噪声干扰,严重影响数据的准确性和可靠性,以致降低水质评价与分类的效果。为此,提出了一种基于差异化创造性搜索算法的数据属性噪声修复方法并将其应用于水质评价。该方法通过差异化创造性搜索理论,设计了一种高效的属性噪声修复框架,通过直觉模糊集识别属性噪声样本,利用熵权法计算每个属性的权重,得到了属性值的噪声分数,有效识别了数据中的属性噪声。为验证方法的有效性,在8个公共数据集上与其他6种属性噪声预处理方法进行了对比实验,并应用于黄河流域七里铺断面地表水水质评价问题。
Abstract: Water quality data are often interfered with attribute noise in the process of collection and transmission, which seriously affects the accuracy and reliability of the data, and thus reduces the effectiveness of water quality evaluation and classification. Therefore, a data attribute noise repair method based on the Differential Creative Search (DCS) algorithm is proposed and applied to water quality evaluation. The method designs an efficient attribute noise repair framework through the theory of differentiated creative search, identifies attribute noise samples through intuitionistic fuzzy sets, calculates the weight of each attribute by using entropy weighting method, and obtains the noise scores of the attribute values, which efficiently identifies the attribute noise in the data. In order to verify the effectiveness of the method, comparative experiments with six other attribute noise preprocessing methods were conducted on eight public datasets and applied to the problem of surface water quality assessment in the Qilipu section of the Yellow River Basin.
文章引用:张冬青, 韩冉冉, 陈继强. 基于差异化创造性搜索算法的数据属性噪声修复方法及其应用[J]. 统计学与应用, 2025, 14(4): 201-215. https://doi.org/10.12677/sa.2025.144102

参考文献

[1] 左其亭. 黄河流域生态保护和高质量发展研究框架[J]. 人民黄河, 2019, 41(11): 1-6, 16.
[2] 杨 程, 郭亚坤, 郑兰香, 等. T-S模糊神经网络模型训练样本构建及其在鸣翠湖水质评价中的应用[J]. 水动力学研究与进展, 2020, 35(3): 356-366.
[3] 郑培超, 周椿棪, 王金梅, 等. 基于KPCA-PSO-ELM算法的地表水化学需氧量紫外-可见吸收光谱检测研究[J]. 光谱学与光谱分析, 2024, 44(3): 707-713.
[4] 朱勇杰, 席晓勇, 李欣甜, 等. 预处理 + 臭氧 + AO + MBR组合工艺在船舶油污水处理的应用[J]. 水处理技术, 2024, 50(3): 142-147.
[5] 窦皓. 河清岸绿, 让城市生活更美好[N]. 人民日报, 2024-03-07(017).
[6] 孙悦, 李再兴, 张艺冉, 等. 雄安新区——白洋淀冰封期水体污染特征及水质评价[J]. 湖泊科学, 2020, 32(4): 952-963.
[7] 田福金, 马青山, 张明, 等. 基于主成分分析和熵权法的新安江流域水质评价[J]. 中国地质, 2023, 50(2): 495-505.
[8] 杜浩田, 栾建勤, 刘霞. 基于灰色关联法的潇河上游水质评价[J]. 人民黄河, 2022, 44(S2): 157-158.
[9] Hu, G., Mian, H.R., Abedin, Z., Li, J., Hewage, K. and Sadiq, R. (2022) Integrated Probabilistic-Fuzzy Synthetic Evaluation of Drinking Water Quality in Rural and Remote Communities. Journal of Environmental Management, 301, Article ID: 113937. [Google Scholar] [CrossRef] [PubMed]
[10] García-Alba, J., Bárcena, J.F., Ugarteburu, C. and García, A. (2019) Artificial Neural Networks as Emulators of Process-Based Models to Analyse Bathing Water Quality in Estuaries. Water Research, 150, 283-295. [Google Scholar] [CrossRef] [PubMed]
[11] Xu, T., Coco, G. and Neale, M. (2020) A Predictive Model of Recreational Water Quality Based on Adaptive Synthetic Sampling Algorithms and Machine Learning. Water Research, 177, Article ID: 115788. [Google Scholar] [CrossRef] [PubMed]
[12] Devroye, L., Györfi, L. and Lugosi, G. (1996) A Probabilistic Theory of Pattern Recognition. Springer.
[13] Sluban, B., Gamberger, D. and Lavra, N. (2010) Advances in Class Noise Detection. In: Coelho, H., Studer, R. and Wooldridge, M., Eds., Proceedings of the Nineteenth European Conference on Artificial Intelligence, IOS Press, 1105-1106.
[14] Teng, C.M. (1999) Correcting Noisy Data. In: Bratko, I. and Dzeroski, S., Eds., Proceedings of the Sixteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., 239-248.
[15] 王石, 李玉忱, 刘乃丽, 等. 在属性级别上处理噪声数据的数据清洗算法[J]. 计算机工程, 2005, 31(9): 86-87, 227.
[16] Zhai, W. and Zhang, F. (2024) Robust Principal Component Analysis Integrating Sparse and Low-Rank Priors. Journal of Computer and Communications, 12, 1-13. [Google Scholar] [CrossRef
[17] Sáez, J.A. and Corchado, E. (2022) ANCES: A Novel Method to Repair Attribute Noise in Classification Problems. Pattern Recognition, 121, Article ID: 108198. [Google Scholar] [CrossRef
[18] Duankhan, P., Sunat, K., Chiewchanwattana, S. and Nasa-Ngium, P. (2024) The Differentiated Creative Search (DCS): Leveraging Differentiated Knowledge-Acquisition and Creative Realism to Address Complex Optimization Problems. Expert Systems with Applications, 252, Article ID: 123734. [Google Scholar] [CrossRef
[19] Ha, M., Wang, C. and Chen, J. (2012) The Support Vector Machine Based on Intuitionistic Fuzzy Number and Kernel Function. Soft Computing, 17, 635-641. [Google Scholar] [CrossRef
[20] 孟朝霞, 尹萍, 贾宏恩. 基于熵权——云模型的某水库水质评价研究[J]. 应用数学进展, 2022, 11(10): 7161-7172.
[21] Yang, X., Huang, P., An, L., Feng, P., Wei, B., He, P., et al. (2022) A Growing Model-Based OCSVM for Abnormal Student Activity Detection from Daily Campus Consumption. New Generation Computing, 40, 915-933. [Google Scholar] [CrossRef
[22] Brodley, C.E. and Friedl, M.A. (1999) Identifying Mislabeled Training Data. Journal of Artificial Intelligence Research, 11, 131-167. [Google Scholar] [CrossRef
[23] Tomek, I. (1976) An Experiment with the Edited Nearest-Neighbor Rule. IEEE Transactions on Systems and Man and Cybernetics, 6, 448-452.
[24] Delany, S.J. and Cunningham, P. (2004) An Analysis of Case-Base Editing in a Spam Filtering System. In: Funk, P. and González Calero, P.A., Eds., Advances in Case-Based Reasoning, Springer, 128-141. [Google Scholar] [CrossRef
[25] Demiar, J. and Schuurmans, D. (2006) Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7, 1-30.