抗乳腺癌候选药物的优化建模
Optimization Modeling of Candidate Drugs for Breast Cancer Treatment
DOI: 10.12677/mos.2025.144266, PDF,   
作者: 牛 寅:上海外国语大学贤达经济人文学院数据科学学院,上海
关键词: pIC50随机森林自适应LASSOBP神经网络pIC50 Random Forest Algorithm Adaptive LASSO BP Neural Network
摘要: 女性乳腺癌发病人数首超肺癌,成为全球最常见癌症,了解乳腺癌的发病机理十分关键。经过研究ERα是治疗乳腺癌的重要靶标,如果能够对ERα及其相关性质进行深入研究,将会为患者带来希望和福音。本研究主要采用自适应LASSO、随机森林算法、BP神经网络算法等数据挖掘方法,建立了一系列变量筛选模型、pIC50生物活性值预测模型,具体做法如下:针对问题一,为消除量纲影响,标准化处理1974个化合物的729个分子描述符。模型一采用随机森林算法直接对特征进行筛选排序。模型二先用LASSO方法从原始变量中筛选出105个重要变量,剔除无关或者影响不太显著的部分变量,再借助随机森林算法对保留下来的变量进行重要性排序。对比两个模型,发现两个模型筛选的主要变量中有5个相同;基于组合模型筛选的变量均通过显著性检验且拟合优度较高;单一模型的筛选结果中存在3个不显著变量。因此,选定LASSO-随机森林算法组合模型对变量的筛选排序结果作为最终输出。针对问题二,将数据集按照8:2的比例划分为训练集、测试集。首先,基于问题一的结果,采用逐步回归法和自适应LASSO法在训练集上分别进行二次筛选,根据投票原则综合确定17个变量作为最终筛选结果和后续神经网络的输入层变量。其次,由于因变量和自变量之间存在非线性关系,采用BP神经网络进行建模,把已知的测试集样本代入训练好的模型中发现拟合效果较好,表明所构建模型的有效性。
Abstract: The number of female breast cancer cases has exceeded that of lung cancer for the first time, making breast cancer the most common cancer globally. Therefore, understanding the pathogenesis of breast cancer is crucial. Through research, it has been found that ERα is an important target for breast cancer treatment. If in-depth research can be carried out on ERα and its related properties, it will bring hope and good news to patients. The study mainly uses data mining methods such as adaptive LASSO, random forest algorithm, and BP neural network algorithm to establish a series of variable screening models and pIC50 bio-activity value prediction models. The specific approaches are as follows: For problem 1, to eliminate the influence of dimensionality, 729 molecular descriptors of 1974 compounds were standardized. In Model 1, the random forest algorithm was directly used to screen and rank the features. In Model 2, the LASSO method was first used to select 105 important variables from the original variables, eliminating some irrelevant or less significant variables. Then, the random forest algorithm was used to rank the importance of the retained variables. By comparing the two models, it was found that there were 5 identical main variables selected by the two models. The variables selected by the combined model all passed the significance test and had a relatively high goodness-of-fit. There were 3 insignificant variables in the screening results of the single model. Therefore, the variable screening and ranking results of the LASSO - random forest algorithm combined model were selected as the final output. For problem 2, the dataset was divided into a training set and a test set at a ratio of 8:2. First, based on the results of problem 1, step-wise regression and adaptive LASSO methods were used for secondary screening on the training set respectively. According to the voting principle, 17 variables were comprehensively determined as the final screening results and the input-layer variables for the subsequent neural network. Second, since there is a non-linear relationship between the dependent variable and the independent variables, a BP neural network was used for modeling. Substituting the known test - set samples into the trained model showed a good fitting effect, indicating the effectiveness of the constructed model.
文章引用:牛寅. 抗乳腺癌候选药物的优化建模[J]. 建模与仿真, 2025, 14(4): 73-85. https://doi.org/10.12677/mos.2025.144266

参考文献

[1] 鲍娇玉, 刘存, 陈文君, 等. 1990-2021年中国女性乳腺癌疾病负担趋势及危险因素分析[J]. 肿瘤预防与治疗, 2025, 38(2): 98-106.
[2] Chae, E.Y., Jun, M.R., Cha, J.H., 等. 早期手术治疗的年轻女性病人乳腺癌复发的MRI和临床病理特征预测模型[J]. 国际医学放射学杂志, 2025, 48(1): 116-117.
[3] 曹明, 徐曼, 杨小苗, 等. 不同生命周期女性乳腺癌临床病理特征研究[J]. 分子诊断与治疗杂志, 2024, 16(11): 2086-2089.
[4] 毛煦. CYP1介导的雌二醇生物活化对女性乳腺癌发展影响的研究进展[J]. 中国药理学与毒理学杂志, 2023, 37(8): 640-644.
[5] 吴琪. 全新雌激素受体亚型ER-α25在乳腺癌细胞中的作用及机制研究[D]: [硕士学位论文]. 桂林: 桂林医学院, 2023.
[6] 沈家成, 姜筝, 孙盛鑫, 等. 吲哚类CB2配体3D-QSAR模型的构建与比较[J]. 广州化工, 2024, 52(23): 109-113.
[7] 崔春利, 闫浩晨, 王敏, 等. 基于衰老视角的3种常见慢病共有机制与中药发现[J]. 西安交通大学学报(医学版), 2025, 46(1): 101-111.
[8] 孙娜, 王谕靖, 周彩红, 等. 基于综合网络药理策略研究夏枯草降压作用机制[J]. 中国药学, 2024, 33(11): 1068-1081.
[9] 曾婷婷. 基于机器学习的房价预测模型研究[D]: [硕士学位论文]. 廊坊: 西南科技大学, 2020.
[10] 刘云翔, 陈斌, 周子宜. 一种基于随机森林的改进特征筛选算法[J]. 现代电子技术, 2019, 42(12): 117-121.
[11] 吴孝情, 赖成光, 陈晓宏, 任秀文. 基于随机森林权重的滑坡危险性评价: 以东江流域为例[J]. 自然灾害学报, 2017, 26(5): 119-129.
[12] 曹正凤. 随机森林算法优化研究[D]: [博士学位论文]. 北京: 首都经济贸易大学, 2014.
[13] 马臣, 邹进美, 夏孟红. 运用Lasso回归模型预测重庆南川地区呼吸道感染的研究[J]. 检验医学与临床, 2025, 22(4): 480-484.
[14] 焦伟, 陈亚宁, 李稚, 李玉朋, 黄晓然, 李海霞. 基于多种回归分析方法的西北干旱区植被NPP遥感反演研究[J]. 资源科学, 2017, 39(3): 545-556.
[15] 郭文军. 中国区域碳排放权价格影响因素的研究——基于自适应Lasso方法[J]. 中国人口·资源与环境, 2015, 25(S1): 305-310.
[16] 张青, 王学雷, 张婷, 杨超, 吕晓蓉. 基于BP神经网络的洪湖水质指标预测研究[J]. 湿地科学, 2016, 14(2): 212-218.