Journal of Geo-information Science >
A Study on the User Behavior of Geoscience Data Sharing Based on Web Usage Mining
Received date: 2015-11-06
Request revised date: 2016-03-16
Online published: 2016-09-27
Copyright
Understanding the user behavior of science data sharing is a key step to implement effective and accurate service for science data sharing. This study aims to explore the user behavior of science data sharing using spatial data mining and Web usage mining techniques for the National Earth System Science Data Sharing Platform. At the stage of data preprocessing, procedures of user identification, session identification and user location identification were performed. Spatial hotspot analysis was conducted to analyze the user pageviews, sessions, and dataset visits to explore the geographical variance of user behaviors using the Getis-Ord Gi* method. FP-growth was taken to be the algorithm for mining association rules, and was performed for analyzing data visits and data downloads. Data mining results show that: (1) the user distribution of data sharing platform does not show significant correlation with the overall university population distribution in China, but shows a significant positive correlation with the population of research-oriented universities; (2) the hotspot analysis shows that regions of hotspots were clustering in Beijing, Tianjin, and northern Hebei Province for all three perspectives, whereas the cold spots geographically scattered to a greater extent, e.g. the southern coastal provinces, Henan Province, Shandong Province, Sichuan Province, etc.; (3) the association rules mining reveals a number of frequently visited item sets and rules from the valuable user pageviews. The frequently visited item sets for data downloads were well coincided with the frequently visited data. However, no conspicuous rules occurred in data downloads. Results of the spatial hotspot analysis and association rules mining detected the geographical variance of users’ interests in data and discovered the usage patterns for the frequently visited data, which can be used for designing the personalized recommendation. This study provides a method for mining web user behaviors with the combination of Web usage mining and spatial data mining techniques, which can also be applied to the data mining of websites in other fields.
WANG Mo , WANG Juanle . A Study on the User Behavior of Geoscience Data Sharing Based on Web Usage Mining[J]. Journal of Geo-information Science, 2016 , 18(9) : 1174 -1183 . DOI: 10.3724/SP.J.1047.2016.01174
Fig.1 An example of Web server log entries图1 Web服务器日志数据示例 |
Tab.1 Contents of a Web server log entry表1 Web服务器日志数据内容 |
类别 | 详情 | fan
---|---|
主机IP | 128.227.49.92 |
时间 | 05/Aug/2014:10:26:42 +0800 |
方法 | GET |
URL | /extra/res/libs/kendo/extensions/kendo.extension.ui.js |
协议 | HTTP/1.1 |
状态 | 200 |
文件大小 | 15072 |
访问来源 | http://www.geodata.cn/extra/TopicsWin2/pro3.jsp |
客户端 | Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0 |
2.1.2 注册用户服务记录 |
用户注册信息在数据挖掘过程中提供了重要的用户外在属性信息,为用户行为的解释提供依据,也可用于用户的分类。本文将采用匿名的用户注册信息,作为辅助数据,判定用户来源。用户注册信息包括用户的学历、职业、联系方式、所在机构等信息。
Fig.2 User pageview matrix (in this case, A, B and C represent different webpages)图2 页面访问会话矩阵示例(A、B、C等表示不同的页面) |
Fig.3 An example of a georeferenced user transaction data model, the blue line represents the transaction vector of a user located at 30°E, 45°N图3 空间信息增强型用户会话向量模型 |
Tab.2 Statistics of data preprocessing results表2 数据预处理结果统计 |
原始日志记录 | 清洗后记录 | 用户数 | 会话数 | 识别位置 |
---|---|---|---|---|
11 062 608 | 2 845 150 | 76 111 | 448 495 | 76 069 |
Fig.4 Distribution of the session length probability图4 用户会话长度概率分布拟合曲线 |
Fig.5 User distribution in China图5 国内用户数量分布 |
Fig.6 Hotspot analysis of user pageviews图6 用户网页浏览数“热点”分析 |
Fig.7 Hotspot analysis of user sessions图7 用户会话数“热点”分析 |
Fig.8 Hotspot analysis of datasets visits图8 用户数据集浏览数“热点”分析 |
Tab.3 Frequent itemsets for datasets visits of all users (S≥10%)表3 所有用户数据访问高频项目集(S≥10%) |
项目集 | 支持度(S)/(%) | 内容描述 |
---|---|---|
100101-22 | 27.1 | 中国1:400万地貌图(形态) |
100101-2 | 12.9 | 中国1:400万资源环境数据(中国地形,1988年) |
100101-18 | 11.6 | 全国土地利用数据库(分省:1980s,1987-2001年;分县:1980s) |
100101-38 | 10.8 | 全国1 km网格人口数据(1995,2000, 2003,2005和2010年) |
100101-66 | 10.6 | 中国1:400万全要素基础数据 (1970s-1990s) |
Tab.4 Frequent itemseds for datasets visits ofactive users (S≥25%)表4 活跃用户数据访问高频项目集(S≥25%) |
项目集 | 支持度(S)/(%) | 内容描述 |
---|---|---|
100101-18 | 34.1 | 全国土地利用数据库(分省:1980s,1987-2001年;分县:1980s) |
100101-38 | 32.4 | 全国1 km网格人口数据(1995、2000、2003、2005和2010年) |
100101-2 | 30.7 | 中国1:400万资源环境数据(中国地形,1988年) |
100101-3 | 29.6 | 1996年浙江省1:25万数字化土地利用现状图 |
100101-30 | 29.2 | 全国多年平均降雨分布图(1 km)(建站到1996年) |
100101-38、100101-18 | 28.0 | 全国1 km网络人口数据、全国土地利用数据库 |
100101-18、100101-2 | 27.5 | 全国土地利用数据库、中国1:400万资源环境数据 |
100101-30、100101-18 | 27.2 | 全国多年平均降雨分布图、全国土地利用数据库 |
100101-66 | 27.1 | 中国1:400万全要素基础数据(1970 s-1990 s) |
100101-18、100101-3 | 26.8 | 全国土地利用数据库、1996年浙江省1:25万数字化土地利用现状图 |
Tab.5 Association rules (C≥90%)表5 关联规则(C≥90%) |
关联规则 | 置信度(C)/(%) |
---|---|
100101-30 ==> 100101-2 | 90.4 |
100101-3==> 100101-18 | 90.8 |
100101-38、 100101-18==> 100101-2 | 91.4 |
100101-18、100101-2==> 100101-3 | 92.4 |
100101-2、100101-18 ==> 100101-38 | 92.9 |
100101-30、100101-18==> 100101-3 | 93.0 |
100101-30 ==> 100101-18 | 93.1 |
100101-18、100101-3==> 100101-30 | 94.1 |
100101-18、100101-2==> 100101-30 | 94.2 |
100101-18、100101-3 ==> 100101-2 | 94.6 |
100101-30、100101-2==> 100101-3 | 95.4 |
100101-30、100101-18 ==> 100101-2 | 95.4 |
100101-2、100101-3 ==>100101-30 | 96.9 |
100101-38、100101-2 ==> 100101-18 | 97.2 |
100101-2、100101-3==> 100101-18 | 97.8 |
100101-30、100101-3 ==> 100101-2 | 98.2 |
100101-30、100101-2 ==> 100101-18 | 98.2 |
100101-30、100101-3==> 100101-18 | 98.5 |
3.3.2 数据下载或申请关联规则 |
Tab.6 Frequent itemsets for datasetsdownloads or application (top 5)表6 注册用户数据下载或申请高频项目集(前5) |
项目集 | 支持度(S)/(%) | 内容描述 |
---|---|---|
100101-66 | 13.7 | 中国1:400万全要素基础数据(1970s-1990s) |
100101-38 | 9.6 | 全国1 km网格人口数据(1995、2000、2003、2005和2010年) |
100101-11860 | 8.1 | 全国1:25万土地覆被数据(1980s,2005年) |
100101-18 | 8.0 | 全国土地利用数据库(分省:1980s,1987-2001年;分县:1980s) |
100101-29 | 7.3 | 陆地卫星MSS/TM/ETM+(1973-2008年、覆盖全国) |
The authors have declared that no competing interests exist.
[1] |
|
[2] |
[
|
[3] |
[
|
[4] |
|
[5] |
|
[6] |
|
[7] |
|
[8] |
|
[9] |
|
[10] |
|
[11] |
|
[12] |
|
[13] |
|
[14] |
|
[15] |
|
[16] |
|
[17] |
|
[18] |
|
[19] |
|
[20] |
|
[21] |
[
|
[22] |
|
[23] |
|
[24] |
|
[25] |
|
[26] |
|
[27] |
|
[28] |
|
[29] |
|
[30] |
|
[31] |
|
[32] |
|
[33] |
|
[34] |
|
[35] |
|
/
〈 | 〉 |