基于宏基因组分析的机器学习疾病预测模型构建
Construction of Machine Learning Disease Prediction Model Based on Macro-Genomic Analysis
摘要: 随着高通量测序技术的发展,宏基因组数据库得到了极大的丰富,为利用其分析人类疾病与健康状况提供了可能,其中基于人类肠道微生物组分析的疾病预测成为了代表性研究方向之一。本文利用以门为单位的分类学肠道微生物数据,即操作分类单元数据,结合非负矩阵分解和变分自动编码器方法,提出了两类新的机器学习分类算法,这些算法旨在提取肠道微生物中的关键信息,以实现对疾病患者的预测。通过降维、数据生成以及引入惩罚约束项等技术手段,我们改善了预测效果、优化了模型的过拟合。在模拟数据、肝硬化数据和糖尿病数据上,我们的预测模型均表现出了较好的性能,AUC值分别达到了0.926、0.956和0.745。
Abstract: With the advancements in high-throughput sequencing technologies, the macro-genomic databases have significantly expanded, offering possibilities for analyzing human health and diseases. Among these possibilities, disease prediction based on the analysis of the human gut microbiota has be-come a prominent research avenue. In this study, we utilized taxonomic gut microbiota data at the phylum level, known as Operational Taxonomic Units (OTU) data, and introduced two novel ma-chine learning classification algorithms by combining non-negative matrix factorization and varia-tional autoencoder methods. These algorithms are designed to extract critical information from the gut microbiota to predict diseases in patients. Through techniques such as dimensionality reduc-tion, data generation, and the incorporation of penalty constraints in the models, we improve the prediction effect and optimize the overfitting of the model. Across simulated data, liver cirrhosis data, and diabetes data, our predictive models demonstrated significant performance, achieving AUC values of 0.926, 0.959, and 0.745, respectively.
文章引用:张钰东. 基于宏基因组分析的机器学习疾病预测模型构建[J]. 应用数学进展, 2024, 13(1): 199-207. https://doi.org/10.12677/AAM.2024.131023

参考文献

[1] Sommer, F., Jacqueline, M., Richa, B., Jeroen, R. and Philip, R. (2017) The Resilience of the Intestinal Microbiota In-fluences Health and Disease. Nature Reviews Microbiology, 15, 630-638. [Google Scholar] [CrossRef] [PubMed]
[2] Jackson, A.M., Verdi, S., Maxan, M.E., Shin, C.M., Zierer, J., Bowyer, R., Martin, T., Williams, F., Menni, C., Bell, J., Spector, T. and Steves, C. (2018) Gut Microbiota Associations with Common Diseases and Prescription Medications in a Population-Based Cohort. NatCommun, 9, Article No. 2655. [Google Scholar] [CrossRef] [PubMed]
[3] Blaxter, M., Mann, J., Chapman, T., Thomas, F., Whitton, C., Floyd, R. and Abebe, E. (2005) Defining Operational Taxonomic Units Using DNA Barcode Data. Philosophical Transactions of the Royal Society B, 360, 1935-1943. [Google Scholar] [CrossRef] [PubMed]
[4] Tsai, K., Lin, S., Liu, W. and Wang, D. (2015) Inferring Microbial In-teraction Network from Microbiome Data Using RMN Algorithm. BMC System Biology, 9, Article No. 54. [Google Scholar] [CrossRef] [PubMed]
[5] Krizhevsky, A., Sutskever, I. and Hinton, G. (2012) Imagenet Classification with Deep Convolutional Neural Networks. Communications of the ACM, 60, 84-90. [Google Scholar] [CrossRef
[6] Tsang, M., Cheng, D. and Liu, Y. (2007) Detecting Statistical Interactions from Neural Network Weights. arXiv:1705.04977.
[7] Bokulich, N., Dillon, M., Bolyen, E., Kaehler, B. and Huttley, G. (2018) q2-Sample-Classifier: Machine-Learning Tools for Microbiome Classification and Regression. Journal of Open Research Software, 3, Article 934. [Google Scholar] [CrossRef] [PubMed]
[8] Lo, C. and Marculescu, R. (2019) MetaNN: Accurate Classification of Host Phenotypes from Metagenomic Data Using Neural Networks. BMC Bioinformatics, 20, Article No. 314. [Google Scholar] [CrossRef] [PubMed]
[9] Sharma, D., Paterson, A., Xu, W. (2020) TaxoNN: Ensemble of Neural Networks on Stratified Microbiome Data for Disease Prediction. Bioinformatics, 36, 4544-4550. [Google Scholar] [CrossRef] [PubMed]
[10] Lee, D. and Seung, H. (1999) Learning the Parts of Objects by Nonnegative Matrix Factorization. Nature, 401, 788-791. [Google Scholar] [CrossRef] [PubMed]
[11] Karthik, D. (2008) Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational Biology. PLOS Computational Biology, 4, e1000029. [Google Scholar] [CrossRef] [PubMed]
[12] Qin, J., et al. (2012) A Metagenome-Wide Association Study of Gut Microbiota in Type 2 Diabetes. Nature, 490, 55-60. [Google Scholar] [CrossRef] [PubMed]
[13] Qin, N., et al. (2014) Alterations of the Human Gut Microbiome in Liver-cirrhosis. Nature, 513, 59-64. [Google Scholar] [CrossRef] [PubMed]
[14] Turpin, W., Sliverberg, M., Kevans, D., Smith, M., et al. (2016) Associa-tion of Host Genome with Intestinal Microbial Composition in a Large Healthy Cohort. Nature Genetics, 48, 1413-1417. [Google Scholar] [CrossRef] [PubMed]
[15] Kong, D., Ding, C. and Huang, H. (2011) Robust Nonnegative Matrix Fac-torization Using L21-Norm. Proceedings of the 20th ACM International Conference on Information and Knowledge Management, 673-682. [Google Scholar] [CrossRef
[16] Wang, D., Liu, J., Gao, Y., Zheng, C. and Xu, Y. (2016) An NMF-l2,1-Norm Constraint Method for Characteristic Gene Secection. PLOS ONE, 11, e0158494. [Google Scholar] [CrossRef] [PubMed]
[17] Wang, Y., Yao, H. and Zhao, S. (2016) Auto-Encoder Based Dimensionality Reduction. Neurcomputing, 184, 232-242. [Google Scholar] [CrossRef
[18] Kingma, D.P. and Welling, M. (2014) Auto-Encoding Varia-tionalbayes. arXiv:1312.6114.