# 基于Spark的并行化出租车轨迹热点区域提取与分析Extraction and Analysis of Hotspot Region of Parallel Taxi Trajectory Based on Spark

Abstract: The taxi GPS trajectory data can mine wealthy residents travel law information, but for the increasing number of data, there are new requirements have been put forward about the accuracy and efficiency of data mining. This paper takes Chengdu taxi GPS trajectory data as the research object. First, the distortion of the original data and the redundant field should be deleted, and partial time data should be filtered, then the map should be matched; finally using the spark Big Data processing platform, it realized K-means| |, divided into working days and rest days to analyze and get the hot spot area of Chengdu residents and its space-time distribution characteristics. Finally, com-pared the performance of the K-means and K-means| |, the result showed that K-means| | had superiority in accuracy and time efficiency compared with the single machine.

1. 引言

2. 数据预处理

2.1. 失真数据剔除

2.2. 多余字段删除

2.3. 部分时段数据过滤

00:00:00~05:59:59时间段出租车基本处于停运状态，该时间段的轨迹数据对于提取居民出行高峰时段和挖掘分析城市热点区域没有参考价值，因此删除这段时间的轨迹数据。

3. 地图匹配

$S=q{\sum }_{i=0}^{n}{D}_{i}$ (1)

(2)

4. 基本原理与方法

Figure 1. Comparison of travel volume of residents in each period

(a) (b)

Figure 2. The comparison of before and after map matching. (a) Before the matched map; (b) After the matched map

$\begin{array}{l}\text{SqDist}={\left(\sqrt{{a}_{1}^{2}+{b}_{1}^{2}}-\sqrt{{a}_{2}^{2}+{b}_{2}^{2}}\right)}^{2}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}={a}_{1}^{2}+{b}_{1}^{2}+{a}_{2}^{2}+{b}_{2}^{2}-2\sqrt{\left({a}_{1}^{2}+{b}_{1}^{2}\right)\left({a}_{2}^{2}+{b}_{2}^{2}\right)}\end{array}$ (3)

5. 实验与分析

Table 1. K-Means|| detailed parameters

5.1. 城市热点提取

$d{h}_{i}=\frac{{n}_{i}}{m}$ , (4)

5.2. K-Means||算法性能分析

Table 2. Distribution of early peak hotspots on August 4

(a) (b) (c)

Figure 3. Distribution of hotspots during peak hours on workday. (a) Early peak distribution; (b) Midday peak distribution; (c) Late peak distribution

(a) (b)

Figure 4. Distribution of hotspots during peak hours on weekend. (a) Midday peak distribution; (b) Late peak distribution

Figure 5. Comparison of running time of different nodes

Figure 6. Acceleration ratio of different nodes

$sq={t}_{1}/{t}_{n}$ (5)

6. 结束语

