核梯度下降在回归问题中的泛化误差与谱偏置
Generalization Error and Spectral Bias of Kernel Gradient Descent in Regression
摘要: 本文主要研究使用核梯度下降求解最小二乘回归问题,并基于再生核希尔伯特空间的谱分解,给出了核梯度下降的泛化误差在各特征子空间上的模态误差关于优化时间的函数,这种分解方法有助于我们理解核梯度下降对回归函数在各特征空间上的分量的带偏置的学习,即谱偏置,以及噪声对各模态学习不同程度的影响。在使用核方法时,核函数的选择以及核函数超参数的选择尤为重要,我们的结果验证了任务与模型的对齐理论,这将帮助我们选择适合任务的核函数。由于宽神经网络的训练过程等价于使用神经正切核进行核梯度下降,本文的模态误差函数同样适用于此类网络。在推导本文主要结果时,我们用到了协方差算子的谱分解、矩阵指数函数的拉普拉斯逆变换以及样本协方差矩阵的各向异性局部律等方法,并用高斯核以及神经正切核在人工合成数据以及MNIST数据集上验证了本文的结果。
Abstract: This paper primarily investigates the use of kernel gradient descent for solving least squares regression problems. Based on the spectral decomposition of reproducing kernel Hilbert spaces, we present the generalization error of kernel gradient descent as a function of optimization time, specifically the mode error on each eigenspace. This decomposition helps us understand the biased learning of the kernel gradient descent on the components of the regression function in different eigenspaces—referred to as spectral bias—as well as the varying effects of noise on the learning of different modes. The choice of kernel function and its hyperparameters is crucial when applying kernel methods, and our results validate the task-model alignment theory, which aids in selecting appropriate kernel functions for specific tasks. Since the training process of wide neural networks is equivalent to kernel gradient descent using the neural tangent kernel, the mode error function derived in this paper is also applicable to such networks. In deriving the main results, we employ techniques such as the spectral decomposition of the covariance operator, the inverse Laplace transform of matrix exponential functions, and the anisotropic local law of the sample covariance matrix. The theoretical findings are validated on both synthetic data and the MNIST dataset using the Gaussian kernel and the neural tangent kernel.
文章引用:马子骁, 崔文泉. 核梯度下降在回归问题中的泛化误差与谱偏置[J]. 理论数学, 2026, 16(6): 142-154. https://doi.org/10.12677/pm.2026.166164

参考文献

[1] Jacot, A., Gabriel, F. and Hongler, C. (2018) Neural Tangent Kernel: Convergence and Generalization in Neural Networks. 32nd International Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, 3-8 December 2018, 8580-8589.
[2] Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R.R. and Wang, R. (2019) On Exact Computation with an Infinitely wide Neural Net. 33rd International Conference on Neural Information Processing Systems, Vancouver, 8-14 December 2019, 8141-8150.
[3] Bietti, A. and Mairal, J. (2019) On the Inductive Bias of Neural Tangent Kernels. Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, 8-14 December 2019, 12893-12904.
[4] Cao, Y., Fang, Z., Wu, Y., Zhou, D. and Gu, Q. (2021) Towards Understanding the Spectral Bias of Deep Learning. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Montreal, 21-26 August 2021, 2205-2211. [Google Scholar] [CrossRef
[5] Bietti, A. and Bach, F. (2021) Deep Equals Shallow for ReLU Networks in Kernel Regimes. International Conference on Learning Representations, Vienna, 4 May 2021, 12913-12934.
[6] Steinwart, I. and Christmann, A. (2008) Support Vector Machines. Springer.
[7] Ali, A., Dobriban, E. and Tibshirani, R. (2020) The Implicit Regularization of Stochastic Gradient Flow for Least Squares. International Conference on Machine Learning, Vienna, 13-18 July 2020, 233-244.
[8] Yao, Y., Rosasco, L. and Caponnetto, A. (2007) On Early Stopping in Gradient Descent Learning. Constructive Approximation, 26, 289-315. [Google Scholar] [CrossRef
[9] Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y. and Courville, A. (2019) On the Spectral Bias of Neural Networks. Proceedings of the 36th International Conference on Machine Learning, Long Beach, 9-15 June 2019, 5301-5310.
[10] Bordelon, B., Canatar, A. and Pehlevan, C. (2020) Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks. Proceedings of the 37th International Conference on Machine Learning, Vienna, 13-18 July 2020, 1024-1034.
[11] Canatar, A., Bordelon, B. and Pehlevan, C. (2021) Spectral Bias and Task-Model Alignment Explain Generalization in Kernel Regression and Infinitely Wide Neural Networks. Nature Communications, 12, Article No. 2914. [Google Scholar] [CrossRef] [PubMed]
[12] Allerbo, O. (2025) Fast Robust Kernel Regression through Sign Gradient Descent with Early Stopping. Electronic Journal of Statistics, 19, 1231-1285. [Google Scholar] [CrossRef
[13] Knowles, A. and Yin, J. (2016) Anisotropic Local Laws for Random Matrices. Probability Theory and Related Fields, 169, 257-352. [Google Scholar] [CrossRef
[14] Sollich, P. (1998) Learning Curves for Gaussian Processes. Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, 1-3 December 1998, 344-350.
[15] Lecun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998) Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86, 2278-2324. [Google Scholar] [CrossRef
[16] Rasmussen, C.E. and Williams, C.K.I. (2005) Gaussian Processes for Machine Learning. The MIT Press. [Google Scholar] [CrossRef
[17] Li, Y., Gan, W., Shi, Z. and Lin, Q. (2024) Generalization Error Curves for Analytic Spectral Algorithms under Power-law Decay. arXiv:2401.01599.
[18] Smola, A., Óvári, Z. and Williamson, R.C. (2000) Regularization with Dot-Product Kernels. Proceedings of the 14th International Conference on Neural Information Processing Systems, Denver, 1 January 2000, 290-296.
[19] Dai, F. and Xu, Y. (2013) Approximation Theory and Harmonic Analysis on Spheres and Balls. Springer.
[20] Ralston, A. and Rabinowitz, P. (2001) A First Course in Numerical Analysis. Dover Publications.