地球信息科学学报 ›› 2016, Vol. 18 ›› Issue (9): 1174-1183.doi: 10.3724/SP.J.1047.2016.01174

• 地球信息科学理论与方法 • 上一篇    下一篇

Web环境下地学数据共享用户行为模式分析

王末1,2, 王卷乐1,3,*()   

  1. 1. 中国科学院地理科学与资源研究所 资源与环境信息系统国家重点实验室,北京 100101
    2. 中国科学院大学,北京 100049
    3. 江苏省地理信息资源开发与利用协同创新中心,南京 210023
  • 收稿日期:2015-11-06 修回日期:2016-03-16 出版日期:2016-09-27 发布日期:2016-09-27
  • 作者简介:

    作者简介:王 末(1987-),男,博士生,研究方向为空间数据挖掘。E-mail: wangm.13b@igsnrr.ac.cn

  • 基金资助:
    国家科技基础条件平台——地球系统科学数据共享平台(2005DKA32300)科技基础性工作重点项目(2011FY110400)中国工程院国际工程科技知识中心项目

A Study on the User Behavior of Geoscience Data Sharing Based on Web Usage Mining

WANG Mo1,2, WANG Juanle1,3,*()   

  1. 1. State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, CAS, Beijing 100101, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
    3. Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing 210023, China
  • Received:2015-11-06 Revised:2016-03-16 Online:2016-09-27 Published:2016-09-27
  • Contact: WANG Juanle

摘要:

了解科学数据共享用户行为特征对实现高效、精准的数据共享服务具有重要的参考意义。本文基于国家地球系统科学数据共享平台网站服务器日志及服务记录数据,利用空间数据挖掘及Web使用挖掘技术,探索地球系统科学数据共享用户行为模式。在数据预处理阶段,完成用户识别、会话识别、位置识别,并对数据进行空间建模、空间数据库建库。在数据挖掘阶段,分别对用户产生的网页浏览数、会话数、数据集浏览数为对象进行空间“热点”分析,识别用户行为的地域差异。针对用户数据浏览和下载行为,采用FP-growth算法对用户——数据之间进行关联规则挖掘,发现用户对数据关注和使用的高频规律。分析结果表明:(1)该共享平台用户地在国内各省市均有分布,用户最多的3个省(市)分别为北京市、山东省、江苏省,该分布与国内高校学生分布相关程度不高,但与“211工程”高校学生的空间分布相关度较高;(2)空间“热点”分析表明,北京、天津及河北北部无论在网页浏览、数据浏览还是会话量上都是“热点”区域,但识别的“冷点”区域有较大不同,尤其是数据访问“冷点”分布较广,如南方沿海省份、河南省、山东省、四川省等;(3)关联规则挖掘发现多个数据浏览高频项目集以及关联规则。数据下载高频项与数据浏览高频模式较好吻合,但下载行为未表现出明显关联规则。本文提供了一种结合Web使用挖掘和空间数据挖掘的用户行为模式挖掘方法,该方法也可用于其他类型网站的数据挖掘。

关键词: 网络数据挖掘, 空间数据挖掘, 用户行为模式, 科学数据共享, 地球系统科学数据

Abstract:

Understanding the user behavior of science data sharing is a key step to implement effective and accurate service for science data sharing. This study aims to explore the user behavior of science data sharing using spatial data mining and Web usage mining techniques for the National Earth System Science Data Sharing Platform. At the stage of data preprocessing, procedures of user identification, session identification and user location identification were performed. Spatial hotspot analysis was conducted to analyze the user pageviews, sessions, and dataset visits to explore the geographical variance of user behaviors using the Getis-Ord Gi* method. FP-growth was taken to be the algorithm for mining association rules, and was performed for analyzing data visits and data downloads. Data mining results show that: (1) the user distribution of data sharing platform does not show significant correlation with the overall university population distribution in China, but shows a significant positive correlation with the population of research-oriented universities; (2) the hotspot analysis shows that regions of hotspots were clustering in Beijing, Tianjin, and northern Hebei Province for all three perspectives, whereas the cold spots geographically scattered to a greater extent, e.g. the southern coastal provinces, Henan Province, Shandong Province, Sichuan Province, etc.; (3) the association rules mining reveals a number of frequently visited item sets and rules from the valuable user pageviews. The frequently visited item sets for data downloads were well coincided with the frequently visited data. However, no conspicuous rules occurred in data downloads. Results of the spatial hotspot analysis and association rules mining detected the geographical variance of users’ interests in data and discovered the usage patterns for the frequently visited data, which can be used for designing the personalized recommendation. This study provides a method for mining web user behaviors with the combination of Web usage mining and spatial data mining techniques, which can also be applied to the data mining of websites in other fields.

Key words: Web usage mining, spatial data mining, user behavior mining, science data sharing, Earth System Science data