地球信息科学学报

• •    

基于Hilbert空间分区和Geohash索引的并行Ripley's K函数

亢扬箫1,2, 桂志鹏1,3,*, 丁劲宸1, 吴京航1, 吴华意3   

  1. 1.武汉大学遥感信息工程学院,武汉430079;
    2.重庆市地理信息与遥感应用中心,重庆401147;
    3.武汉大学测绘遥感信息工程国家重点实验室,武汉430079
  • 收稿日期:2021-08-04 修回日期:2021-08-27
  • 作者简介:亢扬箫(1995— ),男,河南焦作人,硕士,主要从事时空大数据计算与应用相关研究。E-mail: kangyangxiao@foxmail.com
  • 基金资助:
    国家重点研发计划项目(2018YFC0809806、2017YFB0503704);国家自然科学基金项目(41971349、42090010)

Parallel Ripley's K Function based on Hilbert Spatial Partitioning and Geohash Indexing

KANG Yangxiao1,2, GUI Zhipeng1,3,*, DING Jinchen1, WU Jinghang1, WU Huayi3   

  1. 1. School of Remote Sensing and Information Engineering,Wuhan University,Wuhan 430079, China;
    2. Chongqing Geomatics and Remote Sensing Center, Chongqing 401147, China;
    3. State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing,Wuhan University,Wuhan 430079, China
  • Received:2021-08-04 Revised:2021-08-27
  • Contact: *GUI Zhipeng, E-mail: zhipeng.gui@whu.edu.cn
  • Supported by:
    National Key Research and Development Program of China, No.2018YFC0809806,; 2017YFB0503704; National Natural Science Foundation of China, No.41971349, 42090010.

摘要: 作为二阶点模式分析方法,Ripley's K函数(简称K函数)以距离为自变量探测不同尺度下点事件的分布模式及演变规律,在生态学、经济学、地理学等诸多领域得到广泛应用。然而,随着点规模的增加,估计与模拟阶段点对距离遍历计算时间开销激增,严重制约了K函数的应用,算法流程优化与并行加速成为应对海量点数据下K函数性能瓶颈及可计算性问题的关键技术手段。针对默认数据分区未考虑点事件空间邻近性导致跨节点通讯成本高昂且K函数距离阈值较大时索引优化失效的现象,本文提出一种基于空间填充曲线的K函数优化加速方法。该方法采用Hilbert 曲线构建空间分区,在顾及数据空间邻近性的前提下减少分区间数据倾斜和通讯开销;在分区基础上,利用Geohash 编码改进各分区内本地空间索引策略加速点对距离计算。本文以湖北省工商企业注册数据为例,通过对比实验分析了默认分区无索引、KDB分区组合R树索引、本文Hilbert分区组合Geohash 索引算法在不同数据规模、距离阈值、集群规模下的计算耗时。结果表明,300 000 点数据规模下本文方法的时间开销约为默认分区无索引方法的1/4,9 台节点下加速比超过3.6 倍。因此,该方法能有效提升分布式环境下K函数计算性能并具有良好的可伸缩性,可为其他点模式分析方法的优化提供参考。

关键词: Ripley's K函数, 分布式计算, Apache Spark, 高性能地理计算, Hilbert曲线, Geohash 编码, 点模式分析, 空间填充曲线

Abstract: As a second-order analysis method of spatial point patterns, Ripley's K function (K function for short) uses distance as an independent variable to detect the distribution patterns of points under different spatial scales, which has been widely used in distinct fields such as ecology, economics, and geography. However, the applications of K function are limited due to the sharply increased computational cost of nested traversals on the point- pair distance measurements in both estimation and simulation phases when the point size getting larger. Therefore, the optimization of algorithm workflow and parallel acceleration have become the key technologies for tackling the performance bottleneck and computability problem of K function. Among these solutions, hashbased partitioning has been widely adopted in parallel computing frameworks for enabling data decomposition, while R-tree indexes have been proposed to reduce the computational cost of point-pair distance measurements by using spatial query instead. However, default hash-based partitioning methods ignore the spatial proximity of data distributions, while R-tree indexes fail to save query time of neighboring points under large spatial distance threshold comparing with pointwise distance calculation. In order to address these issues, this paper proposes an acceleration method for K function based on the space filling curves. Specifically, the Hilbert curve is adopted to achieve spatial partitioning, which reduces the data tilt and communication cost between partitions by better considering the spatial proximity. Upon the partition result, local indexing based on Geohash code is further developed to improve the spatial indexing strategy, which embeds spatial information in codes for achieving quick distance filtering, in turn accelerates the pointwise distance measurements. To verify the effectiveness of the proposed method, it is compared with two optimization methods adopted in previous studies, i.e., default partition without indexing, and KDB-tree partition with R-tree indexing, by analyzing the calculation time of K function for Point of Interests (POIs) of enterprise registration data in Hubei province, China under different data sizes, spatial distance thresholds, and computing nodes in a computing cluster. Experimental results indicate that the time cost of the proposed method is about 1/4 of that for default partition without indexing under the data scale of 300 000 points. Besides, the speedup ratio is larger than 3.6 times under 9 nodes. Therefore, the proposed method can improve performance of K function effectively in a distributed environment and has a promising scalability and could provide a reference for accelerating other spatial patterns analysis algorithms as well.

Key words: Ripley's K function, distributed computing, Apache Spark, high performance geocomputation, Hilbert curve, Geohash, point pattern analysis, spatial filling curves