基于端–边–云协同的电商直播多模态实时内容审核研究
Study on End-Edge-Cloud Collaborative Multimodal Real-Time Content Moderation for E-Commerce Live Streaming
摘要: 随着电商直播平台的快速发展,“内容生产–实时互动–即时交易”深度耦合的业务模式在显著提升用户参与度和交易转化效率的同时,也放大了内容安全与交易合规风险。针对传统中心化审核架构在高并发、弱网络环境以及隐私合规要求不断强化背景下面临的时延波动大、带宽与算力成本高、敏感数据暴露面广等问题,本文围绕电商直播多模态实时内容审核场景,提出了一种端–边–云协同的技术体系。论文从多模态内容审核、边缘计算协同机制以及轻量化模型部署等理论基础出发,构建了适用于直播连续流场景的“StreamID-Segment-Event”流式数据模型,设计了“L1快速过滤-L2边缘精检-L3云端复核”的分层检测框架,并结合视觉检测、音频识别、文本理解与跨模态一致性校验,实现对违规内容、虚假宣传和误导性营销等风险的实时识别。与此同时,本文提出基于知识蒸馏、量化、异构加速和弹性容器化部署的边缘推理优化方案,以提升系统在资源受限条件下的实时性、准确性与可扩展性。在工程治理层面,论文进一步设计了覆盖安全合规、审计留痕、人机协同复核与线上评估的闭环治理机制,并构建了涵盖实时性、准确性、成本与用户体验的KPI指标体系。研究表明,端–边–云协同架构能够有效降低端到端审核时延和全量回传压力,在兼顾识别精度、系统成本与治理合规性的基础上,为电商直播平台多模态实时内容审核提供了具有工程可行性和应用价值的解决方案。
Abstract: With the rapid development of e-commerce live-streaming platforms, the deeply coupled business paradigm of “content production, real-time interaction, and instant transaction” has significantly improved user engagement and transaction conversion efficiency, while simultaneously amplifying risks related to content safety and transaction compliance. Conventional centralized moderation architectures face substantial challenges in such scenarios, including large latency fluctuations, high bandwidth and computing costs, and broad exposure of sensitive data, especially under conditions of massive concurrency, weak network connectivity, and increasingly stringent privacy and compliance requirements. To address these issues, this paper proposes an end-edge-cloud collaborative technical framework for multimodal real-time content moderation in e-commerce live streaming. Building on the theoretical foundations of multimodal content moderation, edge computing collaboration, and lightweight model deployment, a streaming data model termed “StreamID-Segment-Event” is constructed for continuous live-streaming scenarios. In addition, a hierarchical detection framework consisting of “L1 fast filtering, L2 edge-side fine-grained detection, and L3 cloud-side review” is designed. By integrating visual detection, audio recognition, text understanding, and cross-modal consistency verification, the proposed framework enables real-time identification of risks such as non-compliant content, false advertising, and misleading marketing. Meanwhile, an edge inference optimization scheme based on knowledge distillation, quantization, heterogeneous acceleration, and elastic containerized deployment is developed to improve real-time performance, accuracy, and scalability under resource-constrained conditions. At the engineering governance level, a closed-loop governance mechanism is further established, covering security and compliance, audit logging, human-machine collaborative review, and online evaluation. Moreover, a KPI system is constructed to comprehensively assess timeliness, accuracy, cost, and user experience. The results indicate that the proposed end-edge-cloud collaborative architecture can effectively reduce end-to-end moderation latency and full-volume backhaul pressure. By balancing detection accuracy, system cost, and governance compliance, it provides a technically feasible and practically valuable solution for multimodal real-time content moderation on e-commerce live-streaming platforms.
文章引用:陈茂. 基于端–边–云协同的电商直播多模态实时内容审核研究[J]. 电子商务评论, 2026, 15(5): 302-309. https://doi.org/10.12677/ecl.2026.155519

参考文献

[1] Kohavi, R., Tang, D. and Xu, Y. (2020) Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. [Google Scholar] [CrossRef
[2] Cunningham, S. (2021) Causal Inference: The Mixtape. Yale University Press.
[3] Chandola, V., Banerjee, A. and Kumar, V. (2009) Anomaly Detection: A Survey. ACM Computing Surveys, 41, 1-58. [Google Scholar] [CrossRef
[4] OECD (2022) OECD Guidelines on the Protection of Privacy and Transborder Flows of Personal Data. OECD Publishing. [Google Scholar] [CrossRef
[5] NIST (2020) Zero Trust Architecture. NIST Special Publication 800-207.
[6] European Union (2016) General Data Protection Regulation (GDPR). European Union.
https://eur-lex.europa.eu/eli/reg/2016/679/oj
[7] Shi, W., Cao, J., Zhang, Q., Li, Y. and Xu, L. (2016) Edge Computing: Vision and Challenges. IEEE Internet of Things Journal, 3, 637-646. [Google Scholar] [CrossRef
[8] ETSI (2019) Multi-Access Edge Computing (MEC); Framework and Reference Architecture. ETSI GS MEC 003.
[9] 韩涛, 卜青原, 杨晓蕊. 云新闻发布平台直播流智能审核控制系统的设计与实现[J]. 广播与电视技术, 2023, 50(5): 55-58.
[10] 武开有. 基于人机协同的融媒内容智能审核系统探索与实践[J]. 广播与电视技术, 2025, 52(1): 41-45.
[11] 周辉, 魏日升. 直播电商治理的现实困境与优化路径[J]. 中国市场监管研究, 2025(10): 25-32.
[12] Tang, T., Wu, Y., Wu, Y., Yu, L. and Li, Y. (2022) Videomoderator: A Risk-Aware Framework for Multimodal Video Moderation in E-commerce. IEEE Transactions on Visualization and Computer Graphics, 28, 846-856. [Google Scholar] [CrossRef] [PubMed]
[13] Carbone, P., Katsifodimos, A., Ewen, S., et al. (2015) Apache Flink™: Stream and Batch Processing in a Single Engine. IEEE Data Engineering Bulletin, 38, 28-38.
[14] 晏青, 杜美玲. 驯顺与偏离: 社交媒体平台用户治理研究[J]. 新闻与传播研究, 2024, 31(1): 95-110, 128.
[15] 腾讯云. 实时音视频 内容安全审核[EB/OL].
https://cloud.tencent.com/document/product/647/77791, 2026-03-25.
[16] 阿里云. 直播全链路安全防护体系[EB/OL].
https://help.aliyun.com/zh/live/user-guide/security-overview/, 2026-03-25.
[17] Hinton, G., Vinyals, O. and Dean, J. (2015) Distilling the Knowledge in a Neural Network. arXiv: 1503.02531.
[18] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., et al. (2018) Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 2704-2713. [Google Scholar] [CrossRef
[19] Redmon, J. and Farhadi, A. (2018) YOLOv3: An Incremental Improvement. arXiv: 1804.02767.
[20] 邵仁荣, 刘宇昂, 张伟, 王骏. 深度学习中知识蒸馏研究综述[J]. 计算机学报, 2022, 45(8): 1638-1673.
[21] Banks, J., et al. (2019) MQTT Version 5.0 Specification. OASIS Standard.
[22] Iyengar, J. and Thomson, M. (2021) QUIC: A UDP-Based Multiplexed and Secure Transport. IETF RFC 9000.
[23] Chen, W., Liu, Y., Wang, W., Bakker, E., Georgiou, T., Fieguth, P., Liu, L., and Lew, M. S. (2021) Deep Learning for Instance Retrieval: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 7270-7292.
[24] Graves, A., Mohamed, A. and Hinton, G. (2013) Speech Recognition with Deep Recurrent Neural Networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, 26-31 May 2013, 6645-6649. [Google Scholar] [CrossRef
[25] Ribeiro, M.T., Singh, S. and Guestrin, C. (2016) “Why Should I Trust You?” Explaining the Predictions of Any Classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, 13-17 August 2016, 1135-1144. [Google Scholar] [CrossRef
[26] Howard, A.G., et al. (2017) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv: 1704.04861.
[27] Dwork, C. and Roth, A. (2013) The Algorithmic Foundations of Differential Privacy. Now Publishers Inc. [Google Scholar] [CrossRef