基于Hilbert空间分区和Geohash索引的并行Ripley's K函数
亢扬箫(1995— ),男,河南焦作人,硕士,主要从事时空大数据计算与应用相关研究。E-mail: kangyangxiao@foxmail.com |
收稿日期: 2021-08-04
修回日期: 2021-08-27
网络出版日期: 2022-03-25
基金资助
国家重点研发计划项目(2018YFC0809806)
国家重点研发计划项目(2017YFB0503704)
国家自然科学基金项目(41971349)
国家自然科学基金项目(42090010)
版权
Parallel Ripley's K Function based on Hilbert Spatial Partitioning and Geohash Indexing
Received date: 2021-08-04
Revised date: 2021-08-27
Online published: 2022-03-25
Supported by
National Key Research and Development Program of China(2018YFC0809806)
National Key Research and Development Program of China(2017YFB0503704)
National Natural Science Foundation of China(41971349)
National Natural Science Foundation of China(42090010)
Copyright
作为二阶点模式分析方法,Ripley's K函数(简称K函数)以距离为自变量探测不同尺度下点事件的分布模式及演变规律,在生态学、经济学、地理学等诸多领域得到广泛应用。然而,随着点规模的增加,估计与模拟阶段点对距离遍历计算时间开销激增,严重制约了K函数的应用,算法流程优化与并行加速成为应对海量点数据下K函数性能瓶颈及可计算性问题的关键技术手段。针对默认数据分区未考虑点事件空间邻近性导致跨节点通讯成本高昂且K函数距离阈值较大时索引优化失效的现象,本文提出一种基于空间填充曲线的K函数优化加速方法。该方法采用Hilbert曲线构建空间分区,在顾及数据空间邻近性的前提下减少分区间数据倾斜和通讯开销;在分区基础上,利用Geohash编码改进各分区内本地空间索引策略加速点对距离计算。本文以湖北省工商企业注册数据为例,通过对比实验分析了默认分区无索引、KDB分区组合R树索引、本文Hilbert分区组合Geohash索引算法在不同数据规模、距离阈值、集群规模下的计算耗时。结果表明,300 000点数据规模下本文方法的时间开销约为默认分区无索引方法的1/4,9台节点下加速比超过3.6倍。因此,该方法能有效提升分布式环境下K函数计算性能并具有良好的可伸缩性,可为其他点模式分析方法的优化提供参考。
关键词: Ripley’s K函数; 分布式计算; Apache Spark; 高性能地理计算; Hilbert曲线; Geohash编码; 点模式分析; 空间填充曲线
亢扬箫 , 桂志鹏 , 丁劲宸 , 吴京航 , 吴华意 . 基于Hilbert空间分区和Geohash索引的并行Ripley's K函数[J]. 地球信息科学学报, 2022 , 24(1) : 74 -86 . DOI: 10.12082/dqxxkx.2022.210457
As a second-order analysis method of spatial point patterns, Ripley's K function (K function for short) uses distance as an independent variable to detect the distribution patterns of points under different spatial scales, which has been widely used in distinct fields such as ecology, economics, and geography. However, the applications of K function are limited due to the sharply increased computational cost of nested traversals on the point-pair distance measurements in both estimation and simulation phases when the point size getting larger. Therefore, the optimization of algorithm workflow and parallel acceleration have become the key technologies for tackling the performance bottleneck and computability problem of K function. Among these solutions, hash-based partitioning has been widely adopted in parallel computing frameworks for enabling data decomposition, while R-tree indexes have been proposed to reduce the computational cost of point-pair distance measurements by using spatial query instead. However, default hash-based partitioning methods ignore the spatial proximity of data distributions, while R-tree indexes fail to save query time of neighboring points under large spatial distance threshold comparing with pointwise distance calculation. In order to address these issues, this paper proposes an acceleration method for K function based on the space filling curves. Specifically, the Hilbert curve is adopted to achieve spatial partitioning, which reduces the data tilt and communication cost between partitions by better considering the spatial proximity. Upon the partition result, local indexing based on Geohash code is further developed to improve the spatial indexing strategy, which embeds spatial information in codes for achieving quick distance filtering, in turn accelerates the pointwise distance measurements. To verify the effectiveness of the proposed method, it is compared with two optimization methods adopted in previous studies, i.e., default partition without indexing, and KDB-tree partition with R-tree indexing, by analyzing the calculation time of K function for point of interests (POIs) of enterprise registration data in Hubei province, China under different data sizes, spatial distance thresholds, and computing nodes in a computing cluster. Experimental results indicate that the time cost of the proposed method is about 1/4 of that for default partition without indexing under the data scale of 300 000 points. Besides, the speedup ratio is larger than 3.6 times under 9 nodes. Therefore, the proposed method can improve performance of K function effectively in a distributed environment and has a promising scalability and could provide a reference for accelerating other spatial patterns analysis algorithms as well.
图7 不同分区策略下K函数计算时间随分区规模增大的对比Fig. 7 Comparison of execution times of Ripley's K functions equipped with different partition methods as the partition size increases |
图9 不同分区策略下K函数计算时间及分区长宽比对比Fig. 9 Comparison of execution times of Ripley's K functions and the length-width ratios of partitions for different partition methods |
[1] |
周成虎. 点模式分析[J]. 地理科学进展, 1989, 8(2):8-11.
[
|
[2] |
|
[3] |
|
[4] |
李智, 李卫红. 点模式条件下的犯罪嫌疑人时空同现模式挖掘与分析[J]. 地球信息科学学报, 2018, 20(6):827-836.
[
|
[5] |
曾璇, 崔海山, 刘毅华. 基于网络空间点模式的餐饮店空间格局分析[J]. 地球信息科学学报, 2018, 20(6):837-843.
[
|
[6] |
李清泉, 李德仁. 大数据GIS[J]. 武汉大学学报·信息科学版, 2014, 39(6):641-644.
[
|
[7] |
崔邹森. 基于交叉Ripley’s K函数的点模式Web可视化分析系统设计与实现[D]. 武汉:武汉大学, 2020.
[
|
[8] |
|
[9] |
|
[10] |
王源. 面向海量点模式分析的时空Ripley's K函数优化与加速[D]. 武汉:武汉大学, 2019.
[
|
[11] |
向隆刚, 王德浩, 龚健雅. 大规模轨迹数据的Geohash编码组织及高效范围查询[J]. 武汉大学学报·信息科学版, 2017, 42(1):21-27.
[
|
[12] |
|
[13] |
|
[14] |
左尧, 王少华, 钟耳顺, 等. 高性能GIS研究进展及评述[J]. 地球信息科学学报, 2017, 19(4):437-446.
[
|
[15] |
邱强, 秦承志, 朱效民, 等. 全空间下并行矢量空间分析研究综述与展望[J]. 地球信息科学学报, 2017, 19(9):1217-1227.
[
|
[16] |
|
[17] |
高旭, 桂志鹏, 隆玺, 等. KDSG-DBSCAN:一种基于K-D Tree和Spark GraphX的高性能DBSCAN算法[J]. 地理与地理信息科学, 2017, 33(6):1-7,封3.
[
|
[18] |
|
[19] |
|
[20] |
李绍俊, 钟耳顺, 王少华, 等. 基于状态转移矩阵的Hilbert码快速生成算法[J]. 地球信息科学学报, 2014, 16(6):846-851.
[
|
[21] |
崔登吉. 空间分布模式驱动的空间数据组织与索引研究[D]. 南京:南京师范大学, 2016.
[
|
[22] |
|
[23] |
向隆刚, 高萌, 王德浩, 等. Geohash-Trees:一种用于组织大规模轨迹的自适应索引[J]. 武汉大学学报·信息科学版, 2019, 44(3):436-442.
[
|
[24] |
|
[25] |
|
[26] |
|
[27] |
湛东升, 张文忠, 党云晓, 等. 北京市公共服务设施空间集聚特征分析[J]. 经济地理, 2018, 38(12):76-82.
[
|
[28] |
|
[29] |
|
[30] |
Google. S2 Geometry[EB/OL]. http://s2geometry.io/, 2019-12-18.
|
[31] |
|
/
〈 |
|
〉 |