面向长序列数据分类的变长子序列最优传输模型
Variable-Length Subsequence Optimal Transport Model for Long Sequence Data Classification
DOI: 10.12677/sa.2026.154076, PDF,    科研立项经费支持
作者: 胡佳美, 李文静, 陈继强*:河北工程大学数理科学与工程学院,河北 邯郸
关键词: 序列数据分类最优传输子序列匹配相似性度量Sequence Data Classification Optimal Transport Subsequence Matching Similarity Measure
摘要: 序列数据分类是数据挖掘领域的重要任务,其性能高度依赖于序列间相似性度量的有效性。动态时间规整及其变体作为当前广泛采用的相似性度量方法,在长序列数据分类中存在计算复杂度高、病态对齐、忽略局部结构特征以及过度最大化相似性等问题。为此,本文提出一种面向长序列数据分类的变长子序列最优传输模型。首先,首次将感知重要点与最优传输理论结合,构造变长子序列,以子序列匹配替代传统逐点匹配,显著降低计算复杂度并缓解病态对齐。其次,为加强序列局部结构特征,引入复杂度修正因子构建代价矩阵。再次,设计位置惩罚正则项,以抑制位置过远子序列的不合理匹配,并缓解了过度最大化相似性问题。最后,采用熵正则化最优传输框架,通过Sinkhorn-Knopp快速迭代算法实现高效求解。在20个UCR数据集上的实验结果表明,所提模型在分类准确率和计算效率方面均显著优于主流对比方法,验证了其在长序列数据分类任务中的有效性与可行性。
Abstract: Sequence data classification is a fundamental task in data mining, and its performance heavily depends on the effectiveness of similarity measures between sequences. Dynamic Time Warping (DTW) and its variants, as widely adopted similarity measures, suffer from high computational complexity, pathological alignments, neglect of local structural characteristics, and excessive similarity amplification when applied to long sequence classification. To address these issues, this paper proposes a variable-length subsequence optimal transport model for long sequence data classification. First, by integrating Perceptually Important Points (PIPs) with optimal transport theory, we construct variable-length subsequences to replace traditional point-wise matching with subsequence-level matching, significantly reducing computational complexity while mitigating pathological alignments. Second, a complexity correction factor is introduced to construct a cost matrix that captures both numerical differences and local structural characteristics between subsequences. Third, a position penalty regularization term is designed to suppress unreasonable matches between subsequences that are far apart in temporal position, effectively alleviating excessive similarity amplification. Finally, the model is formulated within an entropy-regularized optimal transport framework and efficiently solved using the Sinkhorn-Knopp iterative algorithm. Experimental results on 20 UCR datasets demonstrate that the proposed model significantly outperforms mainstream methods in both classification accuracy and computational efficiency, validating its effectiveness and feasibility for long sequence data classification tasks.
文章引用:胡佳美, 李文静, 陈继强. 面向长序列数据分类的变长子序列最优传输模型[J]. 统计学与应用, 2026, 15(4): 112-126. https://doi.org/10.12677/sa.2026.154076

参考文献

[1] Toth-Laufer, E. and Batyrshin, I.Z. (2022) Similarity-Based Personalized Risk Calculation. 2022 IEEE 16th International Symposium on Applied Computational Intelligence and Informatics (SACI), Timisoara, 25-28 May 2022, 129-134. [Google Scholar] [CrossRef
[2] Annapurna, N. and Sireesha, V. (2024) Optimizing Investment Selection through Similarity Measurement with Type-2 Intuitionistic Fuzzy Sets. SN Computer Science, 5, Article No. 939. [Google Scholar] [CrossRef
[3] Gao, H., Huo, X., Jiang, Y., Zhu, C. and He, C. (2025) A DTW-Gaussian Spatiotemporal Self-Attention Network and Its Application in Industrial Fault Diagnosis with Unequal-Length Sensor Data. IEEE Transactions on Industrial Informatics, 21, 6188-6197. [Google Scholar] [CrossRef
[4] Sakoe, H. and Chiba, S. (1978) Dynamic Programming Algorithm Optimization for Spoken Word Recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26, 43-49. [Google Scholar] [CrossRef
[5] 魏国强, 周从华, 张婷. 基于多维分段和动态权重DTW的多元时间序列相似性度量方法[J]. 计算机与数字工程, 2021, 49(11): 2299-2304, 2406.
[6] Zhang, Q., Zhang, C., Cui, L., Han, X., Jin, Y., Xiang, G., et al. (2022) A Method for Measuring Similarity of Time Series Based on Series Decomposition and Dynamic Time Warping. Applied Intelligence, 53, 6448-6463. [Google Scholar] [CrossRef
[7] 张艳, 史晨辉, 苏美红, 等. 基于邻域形状的时间序列相似性度量[J]. 计算机工程与设计, 2025, 46(6): 1578-1585.
[8] Li, H. (2021) Time Works Well: Dynamic Time Warping Based on Time Weighting for Time Series Data Mining. Information Sciences, 547, 592-608. [Google Scholar] [CrossRef
[9] Hao, S., Wang, Z. and Yuan, J. (2023) Local Morphological Patterns for Time Series Classification. Intelligent Data Analysis, 27, 653-674. [Google Scholar] [CrossRef
[10] Hong, J.Y., Park, S.H. and Baek, J. (2020) SSDTW: Shape Segment Dynamic Time Warping. Expert Systems with Applications, 150, Article ID: 113291. [Google Scholar] [CrossRef
[11] Li, J., Zhao, J., Wang, F., Kawata, S., Chugo, D. and She, J. (2025) Features Based Dynamic Time Warping of Multidimensional Series for Aligning Sensor Data. IECON 2025—51st Annual Conference of the IEEE Industrial Electronics Society, Madrid, 14-17 October 2025, 1-6. [Google Scholar] [CrossRef
[12] Qiu, L., Qiu, C. and Song, C. (2024) ESDTW: Extrema-Based Shape Dynamic Time Warping. Expert Systems with Applications, 239, Article ID: 122432. [Google Scholar] [CrossRef
[13] 金俊超, 马昌忠, 陈国良, 等. 基于UCR-DTW的地磁序列定位算法[J]. 合肥工业大学学报(自然科学版), 2021, 44(11): 1551-1556.
[14] Lahreche, A. and Boucheham, B. (2021) A Fast and Accurate Similarity Measure for Long Time Series Classification Based on Local Extrema and Dynamic Time Warping. Expert Systems with Applications, 168, Article ID: 114374. [Google Scholar] [CrossRef
[15] Ge, L. and Chen, S. (2020) Exact Dynamic Time Warping Calculation for Weak Sparse Time Series. Applied Soft Computing, 96, Article ID: 106631. [Google Scholar] [CrossRef
[16] Chaerul Haviana, S.F. (2015) Sistem Gesture Accelerometer Dengan Metode Fast Dynamic Time Warping (FastDTW) Jurnal Sistem Informasi Bisnis, 5, 151-160. [Google Scholar] [CrossRef
[17] Tan, C.W., Petitjean, F. and Webb, G.I. (2019) Elastic Bands across the Path: A New Framework and Method to Lower Bound DTW. In: Berger-Wolf, T. and Chawla, N., Eds., Proceedings of the 2019 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, Alberta, 2-4 May 2019, 522-530. [Google Scholar] [CrossRef
[18] Webb, G.I. and Petitjean, F. (2021) Tight Lower Bounds for Dynamic Time Warping. Pattern Recognition, 115, Article ID: 107895. [Google Scholar] [CrossRef
[19] Ye, L. and Keogh, E. (2009) Time Series Shapelets: A New Primitive for Data Mining. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, 28 June-July 2009, 947-956. [Google Scholar] [CrossRef
[20] Gharghabi, S., Imani, S., Bagnall, A., Darvishzadeh, A. and Keogh, E. (2018) Matrix Profile XII: MPdist: A Novel Time Series Distance Measure to Allow Data Mining in More Challenging Scenarios. 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17-20 November 2018, 965-970. [Google Scholar] [CrossRef
[21] Hirschberg, D.S. (1977) Algorithms for the Longest Common Subsequence Problem. Journal of the ACM, 24, 664-675. [Google Scholar] [CrossRef
[22] Stefan, A., Athitsos, V. and Das, G. (2013) The Move-Split-Merge Metric for Time Series. IEEE Transactions on Knowledge and Data Engineering, 25, 1425-1438. [Google Scholar] [CrossRef
[23] Zhao, J. and Itti, L. (2018) ShapeDTW: Shape Dynamic Time Warping. Pattern Recognition, 74, 171-184. [Google Scholar] [CrossRef
[24] Mishra, K., Basu, S. and Maulik, U. (2021) SeqDTW: A Segmentation Based Distance Measure for Time Series Data. Transactions of the Indian National Academy of Engineering, 6, 709-730. [Google Scholar] [CrossRef
[25] 郝石磊, 王志海, 刘海洋. 基于局部梯度和二进制模式的时间序列分类算法[J]. 软件学报, 2022, 33(5): 1817-1832.
[26] 胡萌窈, 王鹏, 汪卫. 基于子序列相似性的时间序列在线状态识别[J/OL]. 计算机应用与软件: 1-8.
https://link.cnki.net/urlid/31.1260.TP.20240725.1401.007, 2026-01-22.
[27] Zhang, H., Feng, J., Li, J. and Yao, Q. (2024) Efficient Top-k DTW-Based Sensor Data Similarity Search Using Perceptually Important Points and Dual-Bound Filtering. IEEE Sensors Journal, 24, 41231-41242. [Google Scholar] [CrossRef
[28] Villani, C. (2021) Topics in Optimal Transportation. Rhode Island: American Mathematical Society.
[29] Zhang, Z., Tang, P. and Corpetti, T. (2020) Time Adaptive Optimal Transport: A Framework of Time Series Similarity Measure. IEEE Access, 8, 149764-149774. [Google Scholar] [CrossRef
[30] Su, B. and Hua, G. (2019) Order-preserving Optimal Transport for Distances between Sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41, 2961-2974. [Google Scholar] [CrossRef] [PubMed]
[31] Latorre, F., Liu, C., Sahoo, D. and Hoi, S.C.H. (2023) OTW: Optimal Transport Warping for Time Series. ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, 4-10 June 2023, 1-5. [Google Scholar] [CrossRef
[32] Chung, F., Fu, T., Luk, R.W. and Ng, V.T.Y. (2001) Flexible Time Series Pattern Matching Based on Perceptually Important Points. Workshop on Learning from Temporal and Spatial Data in International Joint Conference on Artificial Intelligence, Seattle, 6-12 August 2001, 1-7.
[33] Batista, G.E.A.P.A., Keogh, E.J., Tataw, O.M. and de Souza, V.M.A. (2013) CID: An Efficient Complexity-Invariant Distance for Time Series. Data Mining and Knowledge Discovery, 28, 634-669. [Google Scholar] [CrossRef
[34] Dau, H.A., Bagnall, A., Kamgar, K., Yeh, C., Zhu, Y., Gharghabi, S., Ratanamahatana, C.A. and Keogh, E.J. (2018) The UCR Time Series Archive. IEEE/CAA Journal of Automatica Sinica, 6, 1293-1305.
[35] 孟晓静, 万源. 自适应代价动态时间弯曲的多元时间序列相似性度量[J]. 统计与决策, 2020, 36(2): 25-29.
[36] Demiar, J. and Schuurmans, D. (2006) Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7, 1-30.