基于Spark分布式支持向量机的TMS数据纠错方法研究
TMS Data Error Correction Method Based Spark Distributed Support Vector Machine
摘要:
在智能电网通信管理系统(TMS)中产生的大量数据信息有待分析总结,这些数据信息存在账务和实物不一致、数据录入错误以及缺失数据等问题。本文基于Hadoop分布式集群基础框架和Spark通用并行计算平台的分布式支持向量机训练算法,提出一种针对TMS系统数据站点检修次数中的异常数据纠察分析的解决方法。该方法以站点类型为代表的一系列数据为特征属性,使用支持向量机算法建立的模型,对各个站点进行预测和评级,纠察出异常站点,以供相关人员进行排查。最后该方法通过实验进行了验证。
Abstract:
Massive date generated from TMS needs to be analyzed, so as to address the in-consistency between the financial data and real data, wrong data input, and data missing. This paper proposes a method to identify and correct abnormal data in data site maintenance times of TMS, which is based on support vector machine training algorithm running on the Hadoop distributed cluster-based framework and Spark distributed parallel computing platform. To this end, the writer takes a series of data represented by site type as the feature attribute and uses models which support vector machine algorithm to predicate and evaluate each site, thus identifying the abnormal sites needed to be further checked by relevant personnel. This method has been finally verified by experiment.
参考文献
|
[1]
|
维克托•迈尔•舍恩伯格. 大数据时代:生活、工作与思维的大变革[M]. 周涛译. 杭州: 浙江人民出版社, 2012: 8-10.
|
|
[2]
|
杨斌, 杨济海. 大数据在电力系统通信中的应用[J]. 电子技术应用, 2015, 41(SI): 394-396, 400.
|
|
[3]
|
周志华. 机器学习[M]. 北京: 清华大学出版社, 2016: 121-132.
|
|
[4]
|
Bello-Orgaz, G., Jung, J.J. and Camacho, D. (2016) Social Big Data: Recent Achievements and New Challenges. Information Fusion, 28, 45-59. [Google Scholar] [CrossRef] [PubMed]
|
|
[5]
|
Apache Spark官方网站[EB/OL]. http://spark.apache.org/, 2019-10-15.
|
|
[6]
|
Apache Spark ML官方指导文献[EB/OL]. http://spark.apache.org/docs/latest/ml-guide.html, 2019-11-05.
|
|
[7]
|
上海交通大学模式分析与机器智能实验室. LibSVM-2.6 程序代码注释[EB/OL].
http://www.doc88.com/p-6159926915557.html, 2018-05-07.
|
|
[8]
|
Xie, Z.X. and Li, Y.D. (2019) Large-Scale Support Vector Regression with Budgeted Stochastic Gradient Descent. International Journal of Machine Learning and Cy-bernetics, 10, 1529-1541. [Google Scholar] [CrossRef]
|
|
[9]
|
陶杭. 基于Hadoop的SVM算法优化及在文本分类中的应用[D]: [硕士学位论文]. 北京: 北京邮电大学, 2015.
|
|
[10]
|
吴云蔚, 宁芊. 基于Hadoop平台的分布式SVM参数寻优[J]. 计算机工程与科学, 2017, 39(6): 1042-1047.
|
|
[11]
|
邹红旭, 潘冠华, 李吟. 基于Spark框架的改进协同过滤算法[J]. 计算机技术与发展, 2020(5): 1-8.
|
|
[12]
|
李君娣, 张正军, 庄立纯, 等. 基于分类属性IG比的多分类SVM结构评价方法[J]. 计算机工程与科学, 2019, 41(4): 719-726.
|