基于分类算法的变量选择控制图
Variable Selection Control Chart Based on Classification Algorithm
摘要: 高维稀疏过程监控的理论方法需通过真实场景验证才能落地,针对生物信息学、工业生产中数据分布非理想、变量相关性复杂、噪声干扰显著的实际监控痛点,本文基于提出的L0-L2组合正则化变量选择理论,通过L0-L2组合正则化不仅能够选择变量并收缩系数,还能高效处理相关特征,精准识别异常变量,同时利用逻辑回归模型感知特定方向的偏移,并采用极大值函数将二者动态融合,形成一个具有方向自适应的监控统计量。它是一种新的变量选择控制图(LQSVS),结合了分类算法来解决高维、稀疏分类问题,主要聚焦理论创新与模拟验证。现开展真实数据应用验证与优化研究,以UCI大肠杆菌蛋白质数据集为研究对象,首先针对真实数据特性完成预处理,并采用Bootstrap重抽样技术优化控制限计算;其次通过控制变量实验确定最优参数;最终在平均运行长度基准ARL₀ = 200下,验证该方法对失控(OC)数据的平均检测延迟ARL₁低至1.68,结果显著优于传统控制图。实验结果表明,所提方法可有效解决真实高维数据中“稀疏偏移检测灵敏度低、参数适配难”的问题,为蛋白质定位监控、工业多变量过程诊断等场景提供了实用工具。
Abstract: The theoretical methods for high-dimensional sparse process monitoring can only be put into practical application after validation in real scenarios. Aiming to address practical monitoring pain points, such as non-ideal data distribution, complex variable correlation, and significant noise interference, in bioinformatics and industrial production, this paper is based on the proposed L0-L2 combined regularization variable selection theory. The L0-L2 combined regularization can not only select variables and shrink coefficients, but also efficiently handle correlated features and accurately identify abnormal variables. Meanwhile, the logistic regression model is used to sense shifts in specific directions, and the maximum function is adopted to dynamically integrate the two, forming a direction-adaptive monitoring statistic. It is a new variable selection control chart (LQSVS), which combines classification algorithms to solve high-dimensional and sparse classification problems, focusing mainly on theoretical innovation and simulation verification. Now, research on real-data application validation and optimization is carried out, taking the UCI E. coli protein dataset as the research object. Firstly, preprocessing is completed according to the characteristics of real data, and the Bootstrap resampling technique is used to optimize the calculation of control limits. Secondly, the optimal parameters are determined through controlled variable experiments. Finally, under the benchmark of in-control average run length (ARL₀) = 200, it is verified that the average run length for out-of-control (OC) data (ARL₁) of this method is as low as 1.68, which is significantly better than that of traditional control charts. The experimental results show that the proposed method can effectively solve the problems of “low sensitivity to sparse shift detection and difficult parameter adaptation” in real high-dimensional data, and provide a practical tool for scenarios such as protein localization monitoring and industrial multivariate process diagnosis.
参考文献
|
[1]
|
Zhang, S., Xue, L., He, Z., Liu, Y. and Xin, Z. (2023) A Sensitized Variable Selection Control Chart Based on a Classification Algorithm for Monitoring High‐Dimensional Processes. Quality and Reliability Engineering International, 39, 2837-2850. [Google Scholar] [CrossRef]
|
|
[2]
|
Dedieu, A., Hazimeh, H. and Mazumder, R. (2021) Learning Sparse Classifiers: Continuous and Mixed Integer Optimization Perspectives. Journal of Machine Learning Research, 22, 1-47.
|
|
[3]
|
Zou, C., Jiang, W. and Tsung, F. (2011) A Lasso-Based Diagnostic Framework for Multivariate Statistical Process Control. Technometrics, 53, 297-309. [Google Scholar] [CrossRef]
|
|
[4]
|
Zou, C. and Qiu, P. (2009) Multivariate Statistical Process Control Using Lasso. Journal of the American Statistical Association, 104, 1586-1596. [Google Scholar] [CrossRef]
|
|
[5]
|
Zhang, C., Tsung, F. and Zou, C. (2015) A General Framework for Monitoring Complex Processes with Both In-Control and Out-of-Control Information. Computers & Industrial Engineering, 85, 157-168. [Google Scholar] [CrossRef]
|
|
[6]
|
Huang, D.X. and Lu, C.T. (2023) Several Variable Selection Methods Based on Logistic Regression Model. Popular Standardization, 8, 139-141.
|
|
[7]
|
Subbiah, S.S. and Chinnappan, J. (2021) Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review. Ingénierie des systèmes d information, 26, 67-77. [Google Scholar] [CrossRef]
|
|
[8]
|
Wang, K. and Song, Z. (2024) High-Dimensional Categorical Process Monitoring: A Data Mining Approach. IISE Transactions, 57, 1088-1104. [Google Scholar] [CrossRef]
|
|
[9]
|
Horton, P. and Nakai, K. (1996) A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins. International Conference on Intelligent Systems for Molecular Biology, 4, 109-115.
|
|
[10]
|
Landeros, A. and Lange, K. (2022) Algorithms for Sparse Support Vector Machines. Journal of Computational and Graphical Statistics, 32, 1097-1108. [Google Scholar] [CrossRef] [PubMed]
|