基于密度的线数据分组算法研究

魏海涛, 杜云艳, 许开辉, 吴笛, 易嘉伟, 莫洋, 刘张

doi:10.3724/SP.J.1047.2015.00538

地球信息科学学报 >

2015 , Vol. 17 >Issue 5: 538 - 546

DOI: https://doi.org/10.3724/SP.J.1047.2015.00538

本期要文(可全文下载)

基于密度的线数据分组算法研究

展开

1. 山东科技大学测绘科学与工程学院, 青岛266510;
2. 中国科学院地理科学与资源研究所资源与环境信息系统国家重点实验室, 北京100101

魏海涛(1979-),女,博士生,研究方向为海洋数据并行计算。E-mail:Weiht@lresi.ac.cn

收稿日期: 2014-12-26

修回日期: 2015-02-03

网络出版日期: 2015-05-10

基金资助

海洋公益性专项项目"海洋环境信息云计算与云服务体系框架应用研究"(201105033);海洋预报综合信息系统(MiFSIS)研究应用项目(201105017)。

收起

A Line Grouping Algorithm Based on Density

Expand

1. Shandong University of Science and Technology, Qingdao 266510, China;
2. LREIS, Institute of Geographic Sciences and Natural Resources Research, CAS, Beijing 100101, China

Received date: 2014-12-26

Revised date: 2015-02-03

Online published: 2015-05-10

Fold

摘要

目前,地理空间数据面临着由于数据量膨胀和计算量高速增长而引起算法效率低的问题,采用"分而治之"的数据分组策略提高运算效率已成为研究的热点。面向分布不均匀的线数据,本文提出了基于密度的线数据分组算法(简称LGAD)。首先,算法通过查找高密度区提取样本线段,保证了分组算法的起点落到高密区;其次,考虑线空间拓扑关系的复杂性,引用水平、垂直和夹角距离度量线段间距离,创建样本线段与其他线段的距离矩阵;最后,以距离矩阵和最优选择方法实现数据负载均衡分组。实验结果显示,对数据分组和分组后数据进行线段聚类的2个过程中,该算法体现了较好的时间优势,与串行计算相比,在分组数为2-12 时,平均比率达4.3,提高了应用的响应速度,具有较好的实际意义。

关键词： 分而治之; 负载均衡; 并行计算; 分布不均匀; 线数据分组

本文引用格式

魏海涛, 杜云艳, 许开辉, 吴笛, 易嘉伟, 莫洋, 刘张 . 基于密度的线数据分组算法研究[J]. 地球信息科学学报, 2015 , 17(5) : 538 -546 . DOI: 10.3724/SP.J.1047.2015.00538

Abstract

Parallel computing provides a promising solution to accelerate complicated spatial data processing, which is becoming increasingly computational intense. Partitioning large datasets into workload-balanced subgroups remains a challenge, particularly for unevenly distributed spatial data. In this study, a density-based data grouping algorithm was developed to tackle the partition problem for large line data. The algorithm includes three procedures: (1) extracting representative segment samples based on data density distribution; (2) generating a distance matrix between segment samples and the rest of the data by using three line distance measurements into calculations; (3) grouping line segments with data load balanced. Experiments show that the algorithm is able to partition large line data efficiently and evenly into equally sized sub-groups. The speed-up ratios of parallel interpolation save up to 65% of the execution time in comparison with consequential interpolation. A high efficiency of parallel computing was achieved when the datasets were divided into an optimal number of child data groups.

Key words： divide and rule; balanced workloads; parallel computing; unevenly distributed data; line segments

参考文献

[1] Guan Q, Clarke K C. A general-purpose parallel raster processing programming library test application using a geographic cellular automata model[J]. International Jour-nal of Geographical Information Science, 2010,24(5):695-722.
[2] Farber R. CUDA application design and development[M]. Waltham, MA: Elsevier Inc, 2011:18-19.
[3] Kim Y, Shim K, Kim M S, et al. DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce[J]. Information Systems, 2014,42:15-35.
[4] Li L, Xi Y. Research on clustering algorithm and its parallelization strategy[C]. 2011 International Conference on Computational and Information Sciences (ICCIS), 2011: 325-328.
[5] He Y, Tan H, Luo W, et al. An efficient parallel densitybased clustering algorithm using MapReduce[C]. 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS), 2011:473-480.
[6] Abugov D. Oracle spatial partitioning: Best practices (an Oracle white paper)[M]. Redwood, CA: Oracle Inc, 2004.
[7] Meng L, Huang C, Zhao C, et al. An improved Hilbert curve for parallel spatial data partitioning[J]. Geo-spatial Information Science, 2007,10(4):282-286.
[8] Florida-San Román L. Proposition of two layered ionic structures, with xy disorder but z-ordered, in a quasi-liquid system[J]. Revista mexicana de física, 2006,52:208-210.
[9] Sarwat M, Mokbel M F, Zhou X, et al. Generic and efficient framework for search trees on flash memory storage systems[J]. GeoInformatica, 2013,17(3):417-448.
[10] Lee J G, Han J, Whang K Y. Trajectory clustering: A partition-and-group framework[C]. Proceedings of the 2007 ACM SIGMOD international conference on Management of data, 2007:593-694.
[11] Jonk A, Smeulders AW. An axiomatic approach to clustering line-segments[C]. Proceedings of the Third International Conference on Document Analysis and Recognition, 1995:386-389.
[12] Kelly A R, Hancock E R. Grouping-line segments using eigenclustering[C]. Proceedings of the British Machine Vision Conference, 2000:1-10.
[13] Thomas J C R. A new clustering algorithm based on kmeans using a line segment as prototype[C]. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, 2011:638-645.
[14] Gaffney S J, Robertson A W, Smyth P, et al. Probabilistic clustering of extratropical cyclones using regression mixture models[J]. Climate dynamics, 2007,29(4):423-440.
[15] Vinh N N, Le B. Simple spatial clustering algorithm based on R-tree[J]. Multi-disciplinary Trends in Artificial Intelligence,2012:236-245.
[16] Wang S, Armstrong M P. A quadtree approach to domain decomposition for spatial interpolation in grid computing environments[J]. Parallel Computing, 2003,29(10):1481-1504.
[17] 赵春宇, 孟令奎, 林志勇. 一种面向并行空间数据库的数据划分算法研究[J]. 武汉大学学报:信息科学版,2006,31 (11):962-965.
[18] Guttman A. R-trees: A dynamic index structure for spatial searching[J]. ACM, 1984(14):47-57.
[19] Jacob C, Maheswaran G, Santosh Kumar I. Clustering in Hilbert R trees: A study on spatial indexing in R trees[J]. Spatial Indexing using R Trees, 2013:1-3.
[20] Marshall P L, Region V F. Using line intersect sampling for coarse woody debris[R]. Nanaimo, Canada: Vancouver Forest Region, 2000:112-121.
[21] Thompson S K. Line-intercept sampling[A]. In: Sampling
[M]. Hoboken, NJ: John Wiley & Sons Inc, 2012:275-282.
[22] Fischer F, Kleinn C, Fehrmann L, et al. A national level forest resource assessment for Burkina Faso -a field based forest inventory in a semiarid environment combining small sample size with large observation plots[J]. Forest Ecol. Manage, 2011,262(8):1532-1540.
[23] Fehrmann L, Seidel D, Krause B, et al. Sampling for landscape elements—A case study from Lower Saxony, Germany[J]. Environmental monitoring and assessment, 2014,186(3):1421-1430.
[24] 全国统计专业技术资格考试用书编写委员会, 统计业务知识[M]. 北京:中国统计出版社,2011:245-272.
[25] Whittaker E T. On the functions which are represented by the expansions of the interpolation-theory[R]. Edinburgh, Scotland: Edinburgh University, 1915:181-194.
[26] 柳盛,吉根林,李文俊.一种基于连接度的空间线对象聚类算法[J].计算机科学,2011,38(8):179-181.
[27] Merkle R, Hellman M E. Hiding information and signatures in trapdoor knapsacks[C]. IEEE Transactions on Information Theory,1978,24(5):525-530.
[28] Nievergelt J H H, Sevcik K C. The grid file: An adaptable, symmetric multikey file structure ACM Trans[J]. Database Syst, 1984,9:38-71.
[29] Dai B R, Lin I C. Efficient map/reduce-based DBSCAN algorithm with optimized data partition[C]. 2012 IEEE 5th International Conference on Cloud Computing (CLOUD), 2012:59-66.
[30] Xu X, Jäger J, Kriegel H-P. A fast parallel clustering algorithm for large spatial databases[M]. High Performance Data Mining. Springer, 2002:263-290.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献