不同特征的流数据对Flink性能影响研究
Research on the Impact of Different Feature Stream Data on Flink Performance
摘要: 流数据处理引擎的性能,依赖于全局事件时间的设置。为了探讨流数据处理与全局事件时间的关系,本文以研究流数据处理引擎Flink全局事件时间——WaterMark的延迟宽容度为出发点,设计了一套基于Flink的数据流处理管道,用于对流数据进行转换与处理操作。将不同特征的流数据导入Flink数据处理管道,采用统计学的方法,研究不同延迟宽容度取值下Flink引擎的准确率、处理延迟、吞吐量等性能指标。在此基础上,提出了对于不同流数据的延迟宽容度设置方法,实验表明,该方法能够有效提高流数据处理引擎处理乱序流数据的准确率,并降低延迟。
Abstract: For the stream data processing engine, its performance depends on the setting of the global event times. In order to explore the relationship between stream data processing and global event time, starting from studying the global event time of stream data processing engine Flink—the delay tolerance of WaterMark, this paper designed a set of data stream processing pipeline based on Flink for the conversion and processing of stream data. Different characteristic flow data are imported into the Flink data processing pipeline. The statistical method is used to study the accuracy, processing delay, throughput and other performance indicators of the Flink engine under different delay tolerance values. On this basis, a delay tolerance setting method for different stream data is proposed. Experiments show that the method can effectively improve the accuracy of the stream data processing engine to process the disordered stream data and reduce the delay.
文章引用:施国欢, 宋吉, 李江华. 不同特征的流数据对Flink性能影响研究[J]. 计算机科学与应用, 2022, 12(11): 2599-2607. https://doi.org/10.12677/CSA.2022.1211264

参考文献

[1] 毕倪飞, 丁光耀, 陈启航, 徐辰, 周傲英. 数据流计算模型及其在大数据处理中的应用[J]. 大数据, 2020, 6(3): 73-86.
[2] 戚红雨. 流式处理框架发展综述[J]. 信息化研究, 2019, 45(6): 1-8.
[3] 宋灵城. Flink和Spark Streaming流式计算模型比较分析[J]. 通信技术, 2020, 53(1): 59-62.
[4] Apache Flink®—Stateful Computations over Data Streams.
https://flink.apache.org/
[5] Ullah, F., Dhingra, S., Xia, X.Y. and Ali Babar, M. (2022) Evaluation of Distributed Data Processing Frameworks in Hybrid Clouds. ArXiv, 2201.01948.
[6] Guo, Y., Shan, H., Huang, S., et al. (2021) GML: Efficiently Auto-Tuning Flink’s Configurations via Guided Machine Learning. IEEE Transactions on Parallel and Distributed Systems: A Publication of the IEEE Computer Society, 32, 2921-2935. [Google Scholar] [CrossRef
[7] 谭勇. Spark和Flink的计算模型对比研究[J]. 计算机产品与流通, 2019(4): 152-153.
[8] 韩雨轩, 李盼颖, 温秀梅, 马兆辉, 张书玮. 基于流计算框架的对比实验研究[J]. 河北建筑工程学院学报, 2021, 39(2): 145-150.
[9] 汪志峰, 赵宇海, 王国仁. 异构Flink集群中负载均衡算法研究与实现[J]. 南京大学学报: 自然科学版, 2021, 57(1): 110-120.
[10] Van Dongen, G. and Poel, D.V.D. (2021) A Performance Analysis of Fault Recovery in Stream Processing Frameworks. IEEE Access, 9, 93745-93763. [Google Scholar] [CrossRef
[11] Chintapalli, S., Dagit, D., Evans, B., et al. (2016) Bench-marking Streaming Computation Engines: Storm, Flink and Spark Streaming. 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Chicago, IL, 23-27 May 2016, 1789-1792. [Google Scholar] [CrossRef
[12] 詹剑锋, 高婉铃, 王磊, 等. BigDataBench: 开源的大数据系统评测基准[J]. 计算机学报, 2016(1): 196-211.