Journal of Geo-information Science >
A Framework of Distributed Spatial Data Analysis Based on Shark/Spark
Received date: 2014-10-11
Request revised date: 2014-11-26
Online published: 2015-04-10
Copyright
With the development of technology, spatial datasets continue increasing in an incredible speed. Traditional data management based on single-node DBMS hardly meets the demands of high-concurrence in massive data. The rise of cloud computing brings brand new opportunities and challenges. Some researchers adopt a hybrid solution that combines the fault tolerance, heterogeneous cluster, and distributed computing framework together for efficient performances. Derived from the computing framework of Spark, Shark is a computing engine for fast data analysis. When a query is submitted, Shark compiles the query into an operator tree represented by RDDs, which will then be translated by Spark into a graph of tasks for execution. Shark does not support spatial query at the moment; therefore, we introduce an approach to enable Shark/Spark to support spatial query. With the APIs and UDFs that provided by Shark, Shark/Spark has the capability to process spatial data fetching from spatial databases and perform spatial queries according to the demands. Integrating Shark/Spark and relevant components which include mapping, loading, backup and querying of spatial data, and taking the advantages of the efficient spatial data management of spatial databases and high performance computing that involves the large-scale data processing of Spark, a framework of distributed spatial data analysis based on Shark/Spark has been implemented. During the implementation and testing process, we found that in order to achieve a better performance, some queries which had impacts on the returned dataset, should be pushed entirely into the database layer; while the other queries should be performed in Spark. In addition, we found that this system outperformed ArcGIS on Hadoop in some queries because the spatial index of spatial databases could improve its efficiency. Moreover, data management using a spatial database would be much more independent and convenient.
Key words: Shark; Spark; Hadoop; Spatial database; Spatial query
WEN Xin , LUO Kan , CHEN Rongguo . A Framework of Distributed Spatial Data Analysis Based on Shark/Spark[J]. Journal of Geo-information Science, 2015 , 17(4) : 401 -407 . DOI: 10.3724/SP.J.1047.2015.00401
Fig. 1 The architecture of distributed spatial data analysis based on Shark/Spark图1 基于Shark/Spark的分布式空间数据分析框架 |
Fig. 2 Flow chart of spatial data loading图2 空间数据加载 |
Fig. 3 Backup mechanism for spatial data图3 空间数据备份 |
Fig. 4 Query and compiling process of spatial data object图4 空间数据对象查询解析流程 |
Fig. 5 Flow chart of the pushing function in spatial query图5 空间查询中函数下推流程 |
Fig. 6 Testing systems for the cluster图6 集群测试环境 |
Fig. 7 Example of the spatial query in SELECT clause图7 SELECT子句中空间查询示例 |
Fig. 8 Example of the spatial query in WHERE clause图8 WHERE子句中空间查询示例 |
The authors have declared that no competing interests exist.
[1] |
|
[2] |
|
[3] |
|
[4] |
|
[5] |
|
[6] |
|
[7] |
|
[8] |
|
[9] |
|
[10] |
|
[11] |
|
[12] |
|
[13] |
|
[14] |
|
[15] |
|
[16] |
|
[17] |
|
[18] |
|
[19] |
|
[20] |
|
/
〈 |
|
〉 |