电商O2O平台“最后一公里”智能配送:基于深度强化学习的多目标无人机航迹优化
Smart “Last Mile” Delivery for E-Commerce O2O Platforms: Multi-Objective UAV Track Optimization Based on Deep Reinforcement Learning
摘要: 随着电子商务的蓬勃发展,“最后一公里”配送效率已成为O2O平台核心竞争力的关键决定因素。本研究针对链家等O2O (Online to Offline)平台在文件传递、钥匙配送、合同签署等业务场景中面临的时效性、准确性与安全性三重挑战,提出了一种基于深度强化学习(Deep Reinforcement Learning, DRL)的多目标无人机航迹规划方法。与现有研究不同,本工作创新性构建了面向O2O即时物流特性的马尔可夫决策过程(MDP)模型,采用Actor-Critic架构结合多头自注意力机制处理变长的客户点序列,并引入图神经网络(GNN)捕捉城市地理空间拓扑关系。核心贡献在于设计了层次化的多目标奖励函数,通过动态权重调节机制实现配送效率(路径长度与时间窗满足率)、安全性(避障能力与风险规避)和运营成本(能耗与载重利用率)三个冲突目标的协同优化。实验结果表明,在包含50~200个客户点的动态城市仿真环境中,本方法相比遗传算法(GA)和蚁群算法(ACO)等传统优化算法,在路径总长度上平均缩短18.7%,计算时间减少92.3%,动态环境适应性提升65.4%;相比基准DRL方法,样本效率提升40.2%,多目标平衡性能提高22.8%。本研究验证了DRL在复杂动态组合优化问题中的有效性,为O2O平台的智能物流体系建设提供了可落地的技术解决方案,具有重要的理论价值和广阔的应用前景。
Abstract: With the vigorous development of e-commerce, the efficiency of the “last-mile” delivery has become a key determinant of the core competitiveness of O2O platforms. Focusing on the triple challenges of timeliness, accuracy, and security faced by O2O (Online to Offline) platforms such as Lianjia in business scenarios including document delivery, key distribution, and contract signing, this study proposes a multi-objective UAV path planning method based on Deep Reinforcement Learning (DRL). Different from existing studies, this work innovatively constructs a Markov Decision Process (MDP) model tailored to the characteristics of O2O instant logistics. It adopts an Actor-Critic architecture combined with a multi-head self-attention mechanism to process variable-length customer point sequences, and introduces a Graph Neural Network (GNN) to capture the topological relations of urban geospatial space. The core contribution lies in the design of a hierarchical multi-objective reward function, which realizes the collaborative optimization of three conflicting objectives—delivery efficiency (path length and time window satisfaction rate), security (obstacle avoidance and risk aversion), and operation cost (energy consumption and load utilization rate)—through a dynamic weight adjustment mechanism. Experimental results show that in a dynamic urban simulation environment containing 50~200 customer points, compared with traditional optimization algorithms such as Genetic Algorithm (GA) and Ant Colony Optimization (ACO), the proposed method reduces the total path length by 18.7% on average, shortens the computation time by 92.3%, and improves the dynamic environment adaptability by 65.4%. Compared with the benchmark DRL method, it increases the sample efficiency by 40.2% and the multi-objective balancing performance by 22.8%. This study verifies the effectiveness of DRL in complex dynamic combinatorial optimization problems, provides an implementable technical solution for the construction of intelligent logistics systems for O2O platforms, and has important theoretical value and broad application prospects.
文章引用:李雨聪, 党亚峥, 杨灿. 电商O2O平台“最后一公里”智能配送:基于深度强化学习的多目标无人机航迹优化[J]. 电子商务评论, 2026, 15(4): 583-594. https://doi.org/10.12677/ecl.2026.154434

参考文献

[1] 彭建亮, 孙华丽. 复杂城市环境下无人机物流配送路径规划研究[J]. 交通运输系统工程与信息, 2022, 22(4): 215-224.
[2] 王潇, 刘娅汐, 等. 基于改进多目标粒子群算法的无人机物流配送路径优化[J]. 控制工程, 2023, 30(8): 1452-1459.
[3] 李亚飞, 党亚峥. 考虑多重约束的物流无人机路径规划算法研究[J]. 计算机工程与应用, 2023, 59(12): 268-276.
[4] Vinyals, O., Fortunato, M. and Jaitly, N. (2015) Pointer Networks. Advances in Neural Information Processing Systems, Montreal, 7-12 December 2015, 2692-2700.
[5] Bello, I., Pham, H., Le, Q.V., et al. (2017) Neural Combinatorial Optimization with Reinforcement Learning. International Conference on Learning Representations. PMLR, Toulon, 24-26 April 2017, 459-468.
[6] Kool, W., van Hoof, H. and Attention, W.M. (2019) Learn to Solve Routing Problems! International Conference on Learning Representations. ICLR.
[7] Kwon, Y.D., Choo, J., Kim, B., et al. (2020) POMO: Policy Optimization with Multiple Optima for Reinforcement Learning. Advances in Neural Information Processing Systems 33, 6-12 December 2020, 21188-21198.
[8] Kim, M., Park, J. and Park, J. (2022) Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization. Advances in Neural Information Processing Systems 35, New Orleans, 28 November-9 December 2022, 1936-1949. [Google Scholar] [CrossRef
[9] Yang, R., Sun, X. and Narasimhan, K. (2019) A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation. Advances in Neural Information Processing Systems, Vancouver, 14541-14552.
[10] 胡悦, 张万鹏, 王戈, 贺仁杰. 面向复杂环境的图像焦点网络及其在无人系统中的应用[J]. 自动化学报, 2021, 47(3): 574-587.
[11] Ding, L., et al. (2023) Safe Reinforcement Learning for UAV Control under Constraints. Transportation Research Part C: Emerging Technologies, 158, Article ID: 104432.
[12] Kipf, T.N. and Welling, M. (2017) Semi-Supervised Classification with Graph Convolutional Networks. International Conference on Learning Representations (ICLR), Rio de Janeiro, 2017.
[13] 李哲, 陈刚, 张军, 等. 组合优化的深度强化学习: 综述与基准测试[EB/OL]. arXiv: 2406.14697, 2024.