基于多组学数据的乳腺癌预后预测模型构建
Construction of Breast Cancer Prognosis Prediction Model Based on Multi-Omics Data
摘要: 本文主要从UCSC Xena数据库中已经整理好的关于TCGA数据库的乳腺癌数据中,挑选了拷贝数变异、RNA基因表达量、RNA外显子表达量三个组学方面的数据。首先,基于三个组学数据的维度远大于样本量的特征,分别对三个组学的数据进行方差阈值过滤,初步筛选过滤掉变化幅度不大的变量,再使用mRMR进行滤波式的变量选择方法,即最大化特征与分类变量之间的相关性,最小化特征之间的相关性,各自筛选得到50个变量。对于离散型的天数表型数据,采用阈值方法将其转化为0-1分类变量,最终将因变量与自变量进行合并,并划分测试集、训练集,使用svm、XGBoost、Logistic、RandomForest四种方法对结果变量进行预后预测,并采用特定的指标对这四种算法进行比较,运用在训练集上,最终得到XGBoost、Logistic两种算法的预测效果要优于svm、RandomForest。
Abstract: In this paper, we mainly selected the three omics data of copy number variation, RNA gene expres-sion, and RNA exon expression from the breast cancer data on the TCGA database that have been collated in the UCSC Xena database. Firstly, based on the characteristics of the three omics data whose dimensions are much greater than the sample size, the variance threshold filter is performed on the three omics data, the variables with little change are filtered out initially, and then the vari-able selection method is filtered by using mRMR, that is, to maximize the correlation between the features and the categorical variables, minimize the correlation between the features, and filter 50 variables each. For the discrete number of days phenotypic data, the threshold method is used to convert it into a 0-1 categorical variable, and finally the dependent variable is merged with the in-dependent variable, and the test set and the training set are divided, and the outcome variable is predicted by svm, XGBoost, Logistic, randomForest, and the four algorithms are compared with spe-cific indicators, and the training set is applied to the training set, and finally XGBoost. The predic-tion effect of logistic algorithms is better than that of svm and RandomForest.
文章引用:苏婕怡. 基于多组学数据的乳腺癌预后预测模型构建[J]. 应用数学进展, 2022, 11(9): 6723-6729. https://doi.org/10.12677/AAM.2022.119713

参考文献

[1] 王悠清, 编译, Sung, H., Ferlay, J. and Siegel, R.L. 2020全球癌症统计报告[J]. 中华预防医学杂志, 2021, 55(3): 398.
[2] 杨培谦, 吴国荃. 肾细胞癌的大小、分期与生存率[J]. 中华泌尿外科杂志, 1997, 18(8): 454-455.
[3] Mackillop, W.J. (2006) The Importance of Prognosis in Cancer Medicine. John Wiley & Sons, Inc., Hoboken. [Google Scholar] [CrossRef
[4] Mian, Khizar, Hayat, 王铭裕, 李硕磊. 癌症TCGA数据库中乳腺癌预后数据的挖掘[J]. 生物学杂志, 2018, 35(4): 62-66.
[5] Espinal-Enríquez, J., Fresno, C. and Anda-Jáuregui G. (2017) RNA-Seq Based Genome-Wide Analysis Reveals Loss of Inter-Chromosomal Regulation in Breast Cancer. Scientific Reports, 7, Arti-cle Number: 1760. [Google Scholar] [CrossRef] [PubMed]
[6] Haralick, R.M., Shanmugam, K. and Dinstein, I. (1973) Textural Features for Image Classification. IEEE Transactions on Systems, Man, and Cybernetics, SMC-3, 610-621. [Google Scholar] [CrossRef
[7] Kowal, M., et al. (2013) Computer-Aided Diagnosis of Breast Cancer Based on Fine Needle Biopsy Microscopic Images. Computers in Biology and Medicine, 43, 1563-1572. [Google Scholar] [CrossRef] [PubMed]
[8] Arajo, T., Aresta, G., Castro, E., et al. (2017) Classification of Breast Cancer Histology Images Using Convolutional Neural Networks. PLOS ONE, 12, e0177544 [Google Scholar] [CrossRef] [PubMed]
[9] Han, Z.Y., Wei, B.Z., ZhengY.J., et al. (2017) Breast Cancer Mul-ti-Classification from Histopathological Images with Structured Deep Learning Model. Scientific Reports, 7, Article Number: 4172. [Google Scholar] [CrossRef] [PubMed]
[10] Mun, D.G., Bhin, J., Sangok, K., et al. (2019) Proteogeomic Characterization of Human Early-Onset Gastric Cancer. Cancer Cell, 35, 111-124. [Google Scholar] [CrossRef] [PubMed]
[11] Zhang, Y., Ao, L., Chen, P. and Wang, M.H. (2016) Improve Glioblastoma Multiforme Prognosis Prediction by Using Feature Selection and Mul-tiple Kernel Learning. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13, 825-835. [Google Scholar] [CrossRef
[12] Chen, T.Q. and Guestrin, C. (2016) XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD İnternational Conference on Knowledge Discovery and Data Mining, 785-794. [Google Scholar] [CrossRef