地球信息科学学报 ›› 2019, Vol. 21 ›› Issue (1): 128-136.doi: 10.12082/dqxxkx.2019.180221
• 地理大数据时空模式挖掘的方法与应用研究 • 上一篇
潘淼鑫1,2,3(), 林甲祥4, 陈崇成1,*(
), 叶晓燕1
收稿日期:
2018-05-03
修回日期:
2018-07-03
出版日期:
2019-01-20
发布日期:
2019-01-20
通讯作者:
陈崇成
E-mail:pan_miaoxin@qq.com;chencc@fzu.edu.cn
作者简介:
作者简介:潘淼鑫(1987-),女,博士生,主要从事大数据挖掘与云计算研究。E-mail:
基金资助:
Miaoxin PAN1,2,3(), Jiaxiang LIN4, Chongcheng CHEN1,*(
), Xiaoyan YE1
Received:
2018-05-03
Revised:
2018-07-03
Online:
2019-01-20
Published:
2019-01-20
Contact:
Chongcheng CHEN
E-mail:pan_miaoxin@qq.com;chencc@fzu.edu.cn
Supported by:
摘要:
空间离群挖掘可以发现空间数据集中非空间属性值与邻域中其他空间对象明显不同的空间对象。随着空间数据量的快速增加,传统集中式处理模式面临单机性能瓶颈、难以扩展等问题,已逐渐不能满足应用需要。因此,本文根据Spark并行计算框架,充分利用Spark快速内存计算和扩展性的优势,提出了一种基于考虑约束条件的空间离群挖掘算法(C-SOM)和Spark的并行空间离群挖掘算法和原型系统。该并行算法以C-SOM为核心,并行地在多个计算节点对全局数据集和各局部数据集执行C-SOM算法,得到全局离群和局部离群。轻量级的原型系统基于Spark实现了该并行算法,采用Browser/Server架构,提供给用户可视化的操作界面,简洁实用。最后,通过福建省东南沿海土壤化学元素调查数据和人工合成数据的离群分析,验证了该并行算法和原型系统的合理性、有效性和高效性。
潘淼鑫, 林甲祥, 陈崇成, 叶晓燕. 基于C-SOM和Spark的并行空间离群挖掘方法及应用[J]. 地球信息科学学报, 2019, 21(1): 128-136.DOI:10.12082/dqxxkx.2019.180221
Miaoxin PAN, Jiaxiang LIN, Chongcheng CHEN, Xiaoyan YE. Parallel Spatial Outliers Mining based on C-SOM and Spark[J]. Journal of Geo-information Science, 2019, 21(1): 128-136.DOI:10.12082/dqxxkx.2019.180221
表1
实验区并行离群挖掘结果"
序号 | 整个实验区 | 福州地区 | 泉州地区 | |||
---|---|---|---|---|---|---|
对象ID(横坐标,纵坐标) | 离群因子 | 对象ID(横坐标,纵坐标) | 离群因子 | 对象ID(横坐标,纵坐标) | 离群因子 | |
1 | 2270(645 000, 2 849 000) | 11.351 | 2270(645 000, 2 849 000) | 11.659 | 200(673 000, 2 745 000) | 3.610 |
2 | 592(767 000, 2 809 000) | 10.410 | 592(767 000, 2 809 000) | 10.509 | 158(687 000, 2 777 000) | 3.503 |
3 | 2436(703 000, 2 773 000) | 7.604 | 590(761 000, 2 811 000) | 6.355 | 42(703 000, 2 773 000) | 3.045 |
4 | 590(761 000, 2 811 000) | 6.619 | 2271(649 000, 2 851 000) | 4.476 | 170(669 000, 2 773 000) | 2.855 |
5 | 2271(649 000, 2 851 000) | 4.494 | 1208(777 000, 2 933 000) | 3.898 | 161(677 000, 2 779 000) | 2.825 |
6 | 2564(669 000, 2 773 000) | 4.391 | 2190(689 000, 2 855 000) | 3.668 | 181(689 000, 2 757 000) | 2.532 |
7 | 2552(687 000, 2 777 000) | 4.333 | 582(761 000, 2 817 000) | 3.222 | 176(703 000, 2 763 000) | 2.474 |
8 | 1208(777 000, 2 933 000) | 3.874 | 579(757 000, 2 819 000) | 3.115 | 17(673 000, 2 753 000) | 2.218 |
9 | 2190(689 000, 2 855 000) | 3.649 | 585(755 000, 2 813 000) | 3.089 | 2(681 000, 2 771 000) | 2.168 |
10 | 2594(673 000, 2 745 000) | 3.509 | 1869(789 000, 2 831 000) | 3.029 | 162(673 000, 2 767 000) | 1.988 |
11 | 582(761 000, 2 817 000) | 3.204 | 1835(671 000, 2 855 000) | 2.987 | 129(679 000, 2 741 000) | 1.928 |
12 | 266(737 000, 2 851 000) | 3.083 | 1467(717 000, 2 907 000) | 2.818 | 174(673 000, 2 779 000) | 1.877 |
[1] |
Shekhar S, Lu C T, Zhang P.A unified approach to detecting spatial outliers[J]. GeoInformatica, 2003,7(2):139-166.
doi: 10.1023/A:1023455925009 |
[2] |
Singh A K, Lalitha S.A novel spatial outlier detection technique[J]. Communications in Statistics: Theory and Methods, 2018,47(1):247-257.
doi: 10.1080/03610926.2017.1301477 |
[3] |
Chen C C, Lin J X, Wu X Z, et al.Parallel and distributed spatial outlier mining in grid: Algorithm, design and application[J]. Journal of Grid Computing, 2015,13(2):139-157.
doi: 10.1007/s10723-015-9326-y |
[4] | Lu C T, Chen D, Kou Y.Algorithms for spatial outlier detection[C]. Melbourne: Proceeding of 3rd IEEE International Conference on Data Mining, 2003. |
[5] |
Haslett J, Brandley R, Craig P, et al.Dynamic graphics for exploring spatial data with application to location global and local anomalies[J]. The American Statistician, 1991,45(3):234-242.
doi: 10.1080/00031305.1991.10475810 |
[6] | Pannatier Y.Variowin: Software for spatial data analysis in 2D[J]. Statistics & Computing, 1996,11(7):531-534. |
[7] |
Anselin L.Local indicators of spatial association: LISA[J]. Geographical Analysis, 1995,27(2):93-115.
doi: 10.1111/j.1538-4632.1995.tb00338.x |
[8] |
Shekhar S, Lu C T, Zhang P.Detecting graph-based spatial outliers[J]. Intelligent Data Analysis, 2002,6(5):451-468.
doi: 10.3233/IDA-2002-6505 |
[9] | Shekhar S, Lu C T, Zhang P.Detecting graph-based spatial outliers: Algorithms and applications (a summary of results)[C]. San Francisco: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001. |
[10] | 林甲祥. 考虑约束条件的分布式空间离群挖掘及其应用研究[D].福州:福州大学,2010. |
[ Lin J X.Research on distributed spatial outlier mining in the presence of constraints and its applications[D]. Fuzhou: Fuzhou University, 2010. ] | |
[11] | Anselin L.Exploratory spatial data analysis and geographic information systems[J]. New Tools for Spatial Analysis, 1994,17:45-54. |
[12] | Kou Y, Lu C T, Chen B.Spatial weighted outlier detection[C]. Philadelphia: Proceedings of the 6th SIAM International Conference on Data Mining, 2006. |
[13] |
Tsai C F, Lin W C, Ke S W.Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies[J]. Journal of Systems and Software, 2016,122:83-92.
doi: 10.1016/j.jss.2016.09.007 |
[14] |
Gan W S, Lin J C W, Chao H C, et al. Data mining in distributed environment: A survey[J]. WIREs Data Mining and Knowledge Discovery, 2017,7(6):e1216.
doi: 10.1002/widm.1216 |
[15] |
Luo P, Lu K, Shi Z Z, et al.Distributed data mining in grid computing environments[J]. Future Generation Computer Systems, 2007,23(1):84-91.
doi: 10.1016/j.future.2006.04.010 |
[16] |
Gkatzikis L, Koutsopoulos I.Migrate or not? Exploiting dynamic task migration in mobile cloud computing systems[J]. IEEE Wireless Communication, 2013,20(7):24-32.
doi: 10.1109/MWC.2013.6549280 |
[17] | Apache. Hadoop[EB/OL]. . |
[18] | Dean J, Ghemawat S.Mapreduce: Simplified data processing on large clusters[J]. Communications of the ACM, 2008,51(1):107-113. |
[19] | 邬群勇,苏克云,邹智杰 .基于MapReduce的海量公交乘客OD并行推算方法[J].地球信息科学学报,2018,20(5):647-655. |
[ Wu Q Y, Su K Y, Zou Z J.A mapreduce-based method for parallel calculation of bus passenger origin and destination from massive transit data[J]. Journal of Geo-information Science, 2018,20(5):647-655. ] | |
[20] | Apache. Spark[EB/OL]. . |
[21] |
景维鹏,霍帅起.基于自定义RDD的海量遥感图像并行镶嵌方法[J].地球信息科学学报,2017,19(10):1346-1354.
doi: 10.3724/SP.J.1047.2017.01346 |
[ Jing W P, Huo S Q.A model of parallel mosaicking for massive remote sensing images based on self-defined RDD[J]. Journal of Geo-information Science, 2017,19(10):1346-1354. ]
doi: 10.3724/SP.J.1047.2017.01346 |
|
[22] |
王习特,申德荣,白梅,等. BOD:一种高效的分布式离群点检测算法[J].计算机学报,2016,39(1):36-51.
doi: 10.11897/SP.J.1016.2016.00036 |
[ Wang X T, Shen D R, Bai M, et al.BOD: An efficient algorithm for distributed outlier detection[J]. Chinese Journal of Computers, 2016,39(1):36-51. ]
doi: 10.11897/SP.J.1016.2016.00036 |
|
[23] |
张继福,李永红,秦啸,等.基于MapReduce与相关子空间的局部离群数据挖掘算法[J].软件学报,2015,26(5):1079-1095.
doi: 10.13328/j.cnki.jos.004659 |
[ Zhang J F, Li Y H, Qin X, et al.Related-subspace-based local outlier detection algorithm using mapreduce[J]. Journal of Software, 2015,26(5):1079-1095. ]
doi: 10.13328/j.cnki.jos.004659 |
|
[24] |
任燕. 基于MapReduce与距离的离群数据并行挖掘算法[J].计算机系统应用,2018,27(2):151-156.
doi: 10.3969/j.issn.1003-3254.2018.02.025 |
[ Ren Y.Parallel mining of distance-based outliers using mapreduce[J]. Computer Systems & Applications, 2018,27(2):151-156. ]
doi: 10.3969/j.issn.1003-3254.2018.02.025 |
|
[25] | Yu D, Ping L, Li W.Spatio-temporal outlier detection based on cloud computing[J]. Journal of Computational Information Systems, 2014,10(13):5481-5488. |
[26] |
张卫平,刘纪平,仇阿根,等.一种分布式计算的空间离群点挖掘算法[J].测绘科学,2017,42(8):85-90.
doi: 10.16251/j.cnki.1009-2307.2017.08.016 |
[ Zhang W P, Liu J P, Chou A G, et al.A spatial outlier mining algorithm based on distributed computing[J]. Science of Surveying and Mapping, 2017,42(8):85-90. ]
doi: 10.16251/j.cnki.1009-2307.2017.08.016 |
|
[27] |
姚明经,林甲祥,陈崇成,等.网格环境下分布式空间离群挖掘体系的设计与应用[J].地球信息科学学报,2011,13(3):383-390.
doi: 10.3724/SP.J.1047.2011.00383 |
[ Yao M J, Lin J X, Chen C C, et al.Service and application of grid based distributed spatial outliersmining[J]. Journal of Geo-information Science, 2011,13(3):383-390. ]
doi: 10.3724/SP.J.1047.2011.00383 |
|
[28] | Zaharia M, Chowdhury M, Das T, et al.Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. San Jose, USA, 2012. |
[29] |
Lin J X, Chen C C, Wu J W.CD-graph: Planar graph representation for spatial adjacency and neighbourhood relation with constraints[J]. International Journal of Geographical Information Science, 2013,27(10):1902-1923.
doi: 10.1080/13658816.2013.769136 |
[1] | 谢聪慧, 吴世新, 张晨, 孙文涛, 何海芳, 裴韬, 罗格平. 基于谱系聚类的全球各国新冠疫情时间序列特征分析[J]. 地球信息科学学报, 2021, 23(2): 236-245. |
[2] | 聂沛, 陈广胜, 景维鹏. 矢量瓦片并行构建与分布式存储模型研究[J]. 地球信息科学学报, 2020, 22(7): 1487-1496. |
[3] | 陈芳淼, 黄慧萍, 贾坤. 时空大数据在城市群建设与管理中的应用研究进展[J]. 地球信息科学学报, 2020, 22(6): 1307-1319. |
[4] | 胡最. 传统聚落景观基因的地理信息特征及其理解[J]. 地球信息科学学报, 2020, 22(5): 1083-1094. |
[5] | 秦承志. 数字地形分析方法研究的维度——精准、高效、易用[J]. 地球信息科学学报, 2020, 22(4): 720-730. |
[6] | 柯新利, 肖邦勇, 郑伟伟, 马艳春, 李红艳. 城镇-农业-生态空间划定的多情景模拟[J]. 地球信息科学学报, 2020, 22(3): 580-591. |
[7] | 王浩, 王含宇, 杨名宇, 许永森. Retinex图像增强在GPU平台上的实现[J]. 地球信息科学学报, 2019, 21(4): 623-629. |
[8] | 王陆一, 吴健生, 李卫锋. 中小城市公共自行车出行模式与驱动机制研究[J]. 地球信息科学学报, 2019, 21(1): 25-35. |
[9] | 林岭, 孔祥增, 李南, 熊攀. 尼泊尔地震的NOAA卫星数据震前异常分析[J]. 地球信息科学学报, 2018, 20(8): 1169-1177. |
[10] | 孙经纬, 孙广中, 詹石岩, 毛睿, 周英华. SA*:一种多线程路径规划算法[J]. 地球信息科学学报, 2018, 20(6): 753-761. |
[11] | 梁春阳, 林广发, 张明锋, 汪玮杨, 张文富, 林金煌, 邓超. 社交媒体数据对反映台风灾害时空分布的有效性研究[J]. 地球信息科学学报, 2018, 20(6): 807-816. |
[12] | 徐振, 荆耀栋, 毕如田, 高阳, 王鹏. 基于资源环境数据格网化表达的关联模式发现[J]. 地球信息科学学报, 2018, 20(1): 28-36. |
[13] | 邱强, 秦承志, 朱效民, 赵晓芳, 方金云. 全空间下并行矢量空间分析研究综述与展望[J]. 地球信息科学学报, 2017, 19(9): 1217-1227. |
[14] | 周恩波, 毛善君, 李梅, 孙振明. GPU加速的改进PAM聚类算法研究与应用[J]. 地球信息科学学报, 2017, 19(6): 782-791. |
[15] | 王末, 王卷乐, 赫运涛. 地学数据共享网用户Web行为预测及数据推荐方法[J]. 地球信息科学学报, 2017, 19(5): 595-604. |
|