飞流:基于Storm的大规模日志数据实时多维统计分析平台
Flying Streaming: A Platform for Real Time Multidimensional Statistical Analytics of Large-Scale Log Data
DOI: 10.12677/CSA.2017.74043, PDF, HTML, XML, 下载: 1,632  浏览: 5,473 
作者: 赵宏博*, 秦 华:北京工业大学信息学部,北京;赵健博:北京五八信息技术有限公司,北京
关键词: Storm大规模日志数据实时统计分析多维统计分析统一平台Storm Large-Scale Log Data Real-Time Analytics Multidimensional Statistical Analytics General Platform
摘要: 目前国内互联网企业单日日志数据增量达到TB级已很常见,大规模日志数据实时多维统计分析对于企业运行、管理和决策越来越重要。但目前大规模日志数据分析处理技术专业性强,企业中数据处理需求最为急迫的业务部门和运维部门都难有这样的技术能力。本论文整合Flume、Kafka、Storm、HBase等开源系统设计了飞流大规模日志数据实时多维统计分析平台,解决了多种日志数据接入、实时多维度统计分析、用户通过提交配置代替大数据编程来提交、更新和删除任务等关键问题,提供了飞流平台上用户不需要编程就能方便使用的大规模日志数据实时多维统计分析的功能。飞流平台在互联网企业中实际应用效果较好,满足了业务部门和运维部门的大部分日志数据多维统计分析需求。
Abstract: At present, it is common that daily increment of log data reaches TB level in domestic internet companies, and the real-time multidimensional statistical analysis of large-scale log data is be-coming more and more important for enterprise operation, management and decision-making. However, the current large-scale log data analysis and processing technology is very professional, and business departments and operation and maintenance departments whose demand of data processing is most urgent are difficult to have such capacity. This paper designed a real-time multidimensional statistical analysis platform for large-scale log data through integrating Flume, Kafka, Storm, HBase and so on. The platform is named Flying Streaming. It solves some key technical issues, such as manifold log data access, real-time multidimensional statistical analysis, submitting, updating and deleting tasks by configuration instead of programming. Flying Streaming provides users with the ability of real-time multidimensional statistical analysis without programming. The application effect of Flying Streaming in the Internet enterprise is good, and it can meet the needs for Multidimensional Statistical Analytics of most log Data of business departments and operation and maintenance departments.
文章引用:赵宏博, 秦华, 赵健博. 飞流:基于Storm的大规模日志数据实时多维统计分析平台[J]. 计算机科学与应用, 2017, 7(4): 351-358. https://doi.org/10.12677/CSA.2017.74043

参考文献

[1] Apache Flume (2017) Flume Homepage.
http://flume.apache.org/
[2] Percy, M. (2017) Flume NG Performance Measurements.
https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Performance+Measurements
[3] Apache Kafka (2017) Kafka Homepage.
http://kafka.apache.org/
[4] Kreps, J. (2017) Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines).
http://kafka.apache.org/performance
[5] Apache Spark Streaming (2017) Spark Streaming Homepage.
http://spark.apache.org/streaming
[6] Apache Storm (2017) Storm Homepage.
http://storm.apache.org/
[7] Naik, R. and Amin, S. (2017) Microbenchmarking Apache Storm 1.0 Performance.
https://hortonworks.com/blog/microbenchmarking-storm-1-0-performance/
[8] 李川, 鄂海红, 宋美娜. 基于Storm的实时计算框架的研究与应用[J]. 软件, 2014, 35(10): 16-20.
[9] Apache HBase (2017) HBase Homepage.
http://hbase.apache.org/
[10] Misty (2017) Testing HBase Performance and Scalability.
https://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation
[11] 张智, 龚宇. 分布式存储系统HBase关键技术研究[J]. 现代计算机, 2014(32): 33-37.
[12] 陈任飞, 吕玉琴, 侯宾. 基于Flume/Kafka/Spark的分布式日志流处理系统的设计与实现[EB/OL].
http://www.paper.edu.cn/html/releasepaper/2015/07/130/
[13] 薛瑞, 朱晓民. 基于Spark Streaming的实时日志处理平台设计与实现[J]. 电信工程技术与标准化, 2015(9): 55- 58.
[14] 廖开际. 数据仓库与数据挖掘[M]. 北京: 北京大学出版社, 2008: 79-86.