基于电子健康档案中异构时态数据的学习
Learning from Heterogeneous Temporal Da-ta Based on Electronic Health Records
DOI: 10.12677/CSA.2020.101001, PDF,    国家自然科学基金支持
作者: 梁 敏*, 陆 迁, 李宁宁, 莫毓昌:华侨大学数学科学学院计算科学福建省高校重点实验室,福建 泉州;林 栋:福建中医药大学针灸学院,福建 福州
关键词: 电子健康档案随机子序列聚类序列机器学习Electronic Health Record Random Subsequences Clustering Sequences Machine Learning
摘要: 电子健康档案包含大量的纵向数据,对于生物医学信息学研究很有价值。然而,由于数据的复杂结构,包括随时间不均匀分布的临床事件,对标准学习算法提出了挑战。时态数据建模的一些方法依赖于从时间序列中提取单一值,导致有潜在价值时序信息的丢失。因此,如何更好地解释临床数据的时效性,仍然是一个重要的研究问题。本文研究了电子健康档案中时态数据新的表示方法,这些表示保留了时序信息,并且可以由标准机器学习算法直接处理。基于时间序列数据符号化表示的研究方法有多种不同的方式。使用电子健康档案真实数据库中临床测量的数据集的实证研究结果表明,相比使用原始序列或聚类序列,对随机子序列使用距离度量显著提高了预测性能。本文提出的表示方法更好地解释了临床事件的时效性,对于生物医学领域的预测任务十分关键。
Abstract: Electronic health records contain a large number of longitudinal data, which is valuable for biomedical informatics research. However, standard learning algorithms present challenges due to the complex structure of the data and clinical events that are unevenly distributed over time. Some methods of temporal data modeling depend on extracting single values from time series, which leads to the loss of potentially valuable sequential information. Therefore, how to better explain the temporality of clinical data is still an important research question. In this paper, a new representation of temporal data in electronic health records are studied, which preserves the sequential information that can be processed directly by the standard machine learning algorithms. The research method based on time-series data symbol representation has many different ways. Empirical studies using clinically measured datasets in the real-life database of electronic health records have shown that using distance metrics for random subsequences significantly improves predictive performance compared to the use of original sequences or clustering sequences. The representation method proposed in this paper better explains the temporality of clinical events and is key to the prediction task in the biomedical domain.
文章引用:梁敏, 陆迁, 李宁宁, 林栋, 莫毓昌. 基于电子健康档案中异构时态数据的学习[J]. 计算机科学与应用, 2020, 10(1): 1-10. https://doi.org/10.12677/CSA.2020.101001

参考文献

[1] Safran, C., Bloomrosen, M., Hammond, W.E., et al. (2007) Toward a National Framework for the Secondary Use of Health Data: An American Medical Informatics Association White Paper. Journal of the American Medical Informatics Association, 14, 1-9. [Google Scholar] [CrossRef
[2] Hersh, W.R. (2007) Adding Value to the Electronic Health Record through Secondary Use of Data for Quality Assurance, Research, and Surveillance. Clinical Pharmacolo-gy & Therapeutics, 81, 126-128. [Google Scholar] [CrossRef] [PubMed]
[3] Jensen, P.B., Jensen, L.J. and Brunak, S. (2012) Mining Electronic Health Records: Towards Better Research Applications and Clinical Car. Nature Reviews Genetics, 13, 395-405. [Google Scholar] [CrossRef] [PubMed]
[4] Patel, D., Hsu, W. and Lee, M.L. (2008) Mining Relationships among Inter-val-Based Events for Classification. Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, 9-12 June 2008, 393-404. [Google Scholar] [CrossRef
[5] Batal, I., Fradkin, D., Harrison, J., Moerchen, F. and Hauskrecht, M. (2012) Mining Recent Temporal Patterns for Event Detection in Mul-tivariate Time Series Data. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, 12-16 August 2012, 280-288. [Google Scholar] [CrossRef] [PubMed]
[6] Zhao, J. and Henriksson, A. (2016) Learning Temporal Weights of Clinical Events Using Variable Importance. BMC Medical Informatics and Decision Making, 16, 71. [Google Scholar] [CrossRef] [PubMed]
[7] Harpaz, R., Haerian, K., Chase, H.S. and Friedman, C. (2010) Mining Electronic Health Records for Adverse Drug Effects Using Regression Based Methods. The 1st ACM Interna-tional Health Informatics Symposium, Arlington, VA, 11-12 November 2010, 100-107. [Google Scholar] [CrossRef
[8] Zhao, J., Henriksson, A., Asker, L. and Boström, H. (2015) Predic-tive Modeling of Structured Electronic Health Records for Adverse Drug Event Detection. BMC Medical Informatics and Decision Making, 15, S1. [Google Scholar] [CrossRef
[9] Scheff, J.D., Almon, R.R., Du Bois, D.C., Jusko, W.J. and An-droulakis, I.P. (2010) A New Symbolic Representation for the Identification of Informative Genes in Replicated Micro-array Experiments. OMICS: A Journal of Integrative Biology, 14, 239-248. [Google Scholar] [CrossRef] [PubMed]
[10] Siirtola, P., Koskimäki, H., Huikari, V., Laurinen, P. and Röning, J. (2011) Improving the Classification Accuracy of Streaming Data Using Sax Similarity Features. Pattern Recognition Letters, 32, 1659-1668. [Google Scholar] [CrossRef
[11] Hills, J., Lines, J., Baranauskas, E., Mapp, J. and Bagnall, A. (2014) Classification of Time Series by Shapelet Transformation. Data Mining and Knowledge Discovery, 28, 851-881. [Google Scholar] [CrossRef
[12] Gordon, D., Hendler, D. and Rokac, L. (2012) Fast Randomized Model Generation for Shapelet-Based Time Series Classification. Computer Science, 1-10.
[13] Karlsson, I., Papapetrou, P. and Boström, H. (2016) Generalized Random Shapelet Forests. Data Mining and Knowledge Discovery, 30, 1053-1085. [Google Scholar] [CrossRef
[14] Chakrabarti, K., Keogh, E., Mehrotra, S. and Pazzani, M. (2002) Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. ACM Transactions on Database Systems, 27, 188-228. [Google Scholar] [CrossRef
[15] Lin, J., Keogh, E., Lonardi, S. and Chiu, B. (2003) A Symbolic Rep-resentation of Time Series, with Implications for Streaming Algorithms. Proceedings of the 8th ACM SIGMOD Work-shop on Research Issues in Data Mining and Knowledge Discovery, San Diego, CA, 13 June 2003, 2-11. [Google Scholar] [CrossRef
[16] Lin, J., Keogh, E., Wei, L. and Lonardi, S. (2007) Experiencing Sax: A Novel Symbolic Representation of Time Series. Data Mining and Knowledge Discovery, 15, 107-144. [Google Scholar] [CrossRef
[17] Levenshtein, V. (1965) Binary Codes Capable of Correcting Spu-rious Insertions and Deletions of Ones. Problems of Information Transmission, 1, 8-17.
[18] Ye, L. and Keog, E. (2009) Time Series Shapelets: A New Primitive for Data Mining. Proceedings of the 15th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining, Paris, France, 28 June-1 July, 2009, 947-956. [Google Scholar] [CrossRef
[19] Kaufman, L. and Rousseeuw, P.J. (1990) Partitioning around Me-doids (Program PAM). In: Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, New York, 68-125. [Google Scholar] [CrossRef
[20] Reynolds, A.P., Richards, G., de la Iglesia, B. and Ray-ward-Smith, V.J. (2006) Clustering Rules: A Comparison of Partitioning and Hierarchical Clustering Algorithms. Jour-nal of Mathematical Modelling and Algorithms, 5, 475-504. [Google Scholar] [CrossRef
[21] Shannon, C.E. (2001) A Mathematical Theory of Communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5, 3-55. [Google Scholar] [CrossRef
[22] Zeiler, M.D. (2012) ADADELTA: An Adaptive Learning Rate Method. Computer Science, 1-6.
[23] Zhao, J., Henriksson, A., Asker, L. and Boström, H. (2014) Detecting Adverse Drug Events with Multiple Representations of Clinical Measurements. 2014 IEEE International Conference on Bioin-formatics and Biomedicine, Belfast, 2-5 November 2014, 536-543. [Google Scholar] [CrossRef
[24] Breiman, L. (2001) Random Forests. Machine Learning, 45, 5-32. [Google Scholar] [CrossRef
[25] Hanley, J.A. and McNeil, B.J. (1982) The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology, 143, 29-36. [Google Scholar] [CrossRef] [PubMed]
[26] Bradley, A.P. (1997) The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 30, 1145-1159. [Google Scholar] [CrossRef