基于机器学习的抗乳腺癌候选药物的优化
Optimization of Anti-Breast Cancer Drug Candidates Based on Machine Learning
DOI: 10.12677/ORF.2022.121007, PDF,   
作者: 李荟霖, 李明英:贵州大学数学与统计学院,贵州 贵阳
关键词: 特征提取QSAR模型机器学习Feature Selection QSAR Model Machine Learning
摘要: 能够拮抗ERα活性的化合物可能是治疗乳腺癌的候选药物,研究这类化合物对乳腺癌的攻克具有重要意义。本文提出了对治疗乳腺癌的候选药物的实验数据进行数据预处理、特征选择、模型预测的一系列方法。目的:获得具有更好生物活性的新化合物分子。基于k-means聚类与安德鲁斯曲线的异常样本检测模型对异常样本进行剔除;对样本中729个分子描述符进行筛选,保留20个对生物活性最具有显著影响的分子描述符。使用基于三类特征筛选方法的五种方法,基于此建立了多维特征加权提取模型。构建化合物对ERα生物活性的QSAR模型。以PIC50为因变量,对筛选出的20个分子描述符作为自变量,建立了XGBoost,LightGBM机器学习模型,利用网格搜索法获取模型最优参数,保留更有效的模型预测结果。
Abstract: Compounds that can antagonize the activity of ERα may be candidate drugs for the treatment of breast cancer, and it is of great significance to study such compounds in the fight against breast cancer. This paper proposes a series of methods for data preprocessing, feature selection, and model prediction on experimental data of candidate drugs for the treatment of breast cancer. Objective: To obtain new compound molecules with better biological activity. The abnormal sample detection model based on k-means clustering and Andrews curve eliminated abnormal samples; 729 molecular descriptors in the sample were screened, and 20 molecular descriptors with the most significant impact on biological activity were retained. Using five methods based on three types of feature screening methods, a multi-dimensional feature weighted extraction model was established based on this. Construct a quantitative prediction model of the compound’s biological activity on ERa. Using PIC50 as the dependent variable and the 20 molecular descriptors selected as independent variables, the XGBoost and LightGBM machine learning models were established, and the grid search method was used to obtain the optimal parameters of the model to retain more effective model prediction results.
文章引用:李荟霖, 李明英. 基于机器学习的抗乳腺癌候选药物的优化[J]. 运筹与模糊学, 2022, 12(1): 68-80. https://doi.org/10.12677/ORF.2022.121007

参考文献

[1] 袁文芳, 张艳琼. 雌激素在妇产疾病中的作用[J]. 医学信息, 2021, 34(10): 54-58.
[2] 王嘉铭, 李宁宁, 唐毅, 苏榕. 雄激素受体与乳腺癌内分泌耐药的研究进展[J]. 现代肿瘤医学, 2021, 29(3): 524-527.
[3] 高世勇, 吕凤, 许东旭. 雌激素受体及其与乳腺癌相关性研究进展[J]. 药学进展, 2020, 44(11): 861-868.
[4] Yang, L., Sang, C.H., Wang, Y.H., Liu, W.T., Hao, W.Y., Chang, J. and Li, J.Z. (2021) Development of QSAR Models for Evaluating Pesticide Toxicity against Skeletonema costatum. Chemosphere, 285, 131456. [Google Scholar] [CrossRef] [PubMed]
[5] Zhang, H., Shen, C., Zhang, H.R., Chen, W.X., Luo, Q.Q. and Ding, L. (2021) Discovery of Novel DGAT1 Inhibitors by Combination of Machine Learning Methods, Pharmacophore Model and 3D-QSAR Model. Molecular Diversity, 25, 1481-1495. [Google Scholar] [CrossRef] [PubMed]
[6] 徐爱兰, 朱晏民, 孙强, 於香湘, 彭小燕. 基于K-means划分区域的深度学习空气质量预报[J]. 南通大学学报(自然科学版), 2021, 20(3): 49-56.
[7] 魏东, 张天祎, 冉义兵. 基于特征选择及机器学习的犯罪预测方法综述[J]. 科学技术与工程, 2021, 21(28): 11910-11920.
[8] 李郅琴, 杜建强, 聂斌, 熊旺平, 黄灿奕, 李欢. 特征选择方法综述[J]. 计算机工程与应用, 2019, 55(24): 10-19.
[9] 雷惠敏, 张和生. 最优特征选择下多层次分割的城市道路提取[J]. 中国空间学术, 2021: 1-9. http://kns.cnki.net/kcms/detail/11.1859.V.20211025.1101.002.html
[10] 廉睿玲. XGBoost算法在四川省GPM降水数据降尺度中的应用[J]. 水电能源科学, 2021, 39(10): 14-17.
[11] 王晓东, 安瑞东. 基于机器学习的热轧带钢力学性能预测模型及应用[J]. 塑性工程学报, 2021, 28(10): 155-165.
[12] 齐巧娜, 刘艳, 陈霁晖, 刘昕竹, 杨锐, 张津源, 崔梦璇, 谢艺萌, 王则远, 于泽, 高飞, 张健. 机器学习XGBoost算法在医学领域的应用研究进展[J]. 分子影像学杂志, 2021, 44(5): 856-862.
[13] 宫鹏, 王德兴, 袁红春, 陈冠奇, 吴若有. 基于LightGBM的南太平洋长鳍金枪鱼渔场预报模型研究[J]. 水产科学, 2021, 40(5): 762-767.
[14] 上官艺, 王孟, 王春娟, 谷鸿秋, 赵性泉, 王伊龙, 王拥军, 李子孝. 基于机器学习的缺血性卒中功能预后预测模型研究[J]. 中国卒中杂志, 2021, 16(9): 895-900.