Orginal Article

A Study on the User Behavior of Geoscience Data Sharing Based on Web Usage Mining

  • WANG Mo 1, 2 ,
  • WANG Juanle , 1, 3, *
Expand
  • 1. State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, CAS, Beijing 100101, China
  • 2. University of Chinese Academy of Sciences, Beijing 100049, China
  • 3. Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing 210023, China
*Corresponding author: WANG Juanle, E-mail:

Received date: 2015-11-06

  Request revised date: 2016-03-16

  Online published: 2016-09-27

Copyright

《地球信息科学学报》编辑部 所有

Abstract

Understanding the user behavior of science data sharing is a key step to implement effective and accurate service for science data sharing. This study aims to explore the user behavior of science data sharing using spatial data mining and Web usage mining techniques for the National Earth System Science Data Sharing Platform. At the stage of data preprocessing, procedures of user identification, session identification and user location identification were performed. Spatial hotspot analysis was conducted to analyze the user pageviews, sessions, and dataset visits to explore the geographical variance of user behaviors using the Getis-Ord Gi* method. FP-growth was taken to be the algorithm for mining association rules, and was performed for analyzing data visits and data downloads. Data mining results show that: (1) the user distribution of data sharing platform does not show significant correlation with the overall university population distribution in China, but shows a significant positive correlation with the population of research-oriented universities; (2) the hotspot analysis shows that regions of hotspots were clustering in Beijing, Tianjin, and northern Hebei Province for all three perspectives, whereas the cold spots geographically scattered to a greater extent, e.g. the southern coastal provinces, Henan Province, Shandong Province, Sichuan Province, etc.; (3) the association rules mining reveals a number of frequently visited item sets and rules from the valuable user pageviews. The frequently visited item sets for data downloads were well coincided with the frequently visited data. However, no conspicuous rules occurred in data downloads. Results of the spatial hotspot analysis and association rules mining detected the geographical variance of users’ interests in data and discovered the usage patterns for the frequently visited data, which can be used for designing the personalized recommendation. This study provides a method for mining web user behaviors with the combination of Web usage mining and spatial data mining techniques, which can also be applied to the data mining of websites in other fields.

Cite this article

WANG Mo , WANG Juanle . A Study on the User Behavior of Geoscience Data Sharing Based on Web Usage Mining[J]. Journal of Geo-information Science, 2016 , 18(9) : 1174 -1183 . DOI: 10.3724/SP.J.1047.2016.01174

1 引言

科学数据是科学研究的基本条件,为科学结论的提出、科学决策的产生等提供基础[1]。数据密集型科学研究、数据驱动下的科学发现已成为当今科学研究的新范式。这些研究首先要有足够的数据支撑,需要有支持多源数据访问的数据开放环境,因而科学数据共享成为这一科学范式实现的最基本需求。数据共享在学术界有悠久的历史。系统的数据共享理念出现在20世纪的下半叶,当时科学界的“大挑战”(如希格斯粒子、人类基因系列、全球气候变化等)使学术界意识到跨领域的数据共享的重要性。发达国家很早就开始在政策和国家制度层面重视数据共享问题,如美国从20世纪下半叶开始建立以法律为保障的数据共享机制,并把数据共享提升到国家战略高度[2]。中国自20世纪80年代开始,也在多个层面推动了科学数据共享,并在21世纪初先后启动了国家科学数据共享工程和国家科技基础条件平台建设[3]。在这一背景下,掌握科学数据共享用户行为特征对实现高效、精准地数据共享服务,甚至对于数据共享政策的制定都具有重要的参考意义。
随着计算机及互联网技术高速发展,互联网已成为科学数据共享的主要途径。科学数据共享用户通过专业的科学数据共享网站获取研究数据已成为科研流程的一部分。因此,科学数据共享用户在网络上的行为可被视为Web使用行为。Web使用挖掘是一个通过服务器产生的网络日志数据发现有价值的知识和用户行为模式的研究领域[4-5]。该领域理论和方法的研究成果已经得到广泛的应用,如基于网页导航的用户模式分析[6-8]、用户行为预测[9-10]、个性化推荐[11-13]、网站服务改进[14-15]。而因网站的专业领域不同,Web使用挖掘所服务的领域非常广泛,如电子商务[16-17]、医疗保健[18]、网络教学[19-20]、旅游业分析[21-22]。对购物网站的用户行为挖掘可以得到社会消费需求和购买趋势等知识;对医疗保健服务的网络挖掘可以分析医疗需求、医疗患者划分等知识;对网络教学网站用户行为模式的挖掘可以获得用户的学习需求、最佳课程组合等知识;对旅游网站用户行为进行数据挖掘可得到用户的行为偏好,给出相应的旅行地推荐。然而,在科学数据共享领域,目前还缺乏对用户的行为模式和规律的掌握。通过对科学数据共享网站用户行为模式的挖掘可得到用户对数据的需求、用户的聚类、数据的关联规则等知识,可为提高数据共享效率、改进数据共享服务策略提供参考依据。
国家科技基础条件平台——地球系统科学数据共享平台(geodata.cn)是中国主要的地学领域数据共享网络。其共享的数据资源类型全面,数据类型涵盖大气圈、陆地表层、陆地水圈、自然资源、海洋等,在中国科学数据共享领域具有较好的代表性。本研究基于该国家平台网站日志数据及服务记录数据,使用Web使用挖掘以及空间数据挖掘技术,挖掘该平台网站的数据共享行为模式。

2 数据与方法

2.1 数据

本文的数据来源主要为Web服务器日志数据、注册用户服务记录数据以及用户注册信息。
2.1.1 Web服务器日志数据
Web服务器日志数据记录了访问者的导航行为。它是Web使用记录挖掘中的首要数据来源。每一次对服务器的访问相当于一个HTTP请求,在服务器访问日志里产生一条记录。每条日志记录包含多个部分(由日志格式决定),通常包括请求的时间与日期、客户端的IP地址、所请求的资源、调用的Web应用程序所使用的参数、请求状态、使用的HTTP方法、用户代理、被哪个网络资源调用等,在某些浏览器环境设置下还会有记录用户重复访问信息的客户端cookies。
本文获取了国家地球系统科学数据共享平台2014年的Web服务器日志记录文件以及数据库日志文件。Web服务器日志格式为Apache的NCSA ECLF格式(图1)。全年日志记录共11 062 608条。
Fig.1 An example of Web server log entries

图1 Web服务器日志数据示例

以其中一条日志为例,可以从日志数据中可以整理出表1所示信息:
Tab.1 Contents of a Web server log entry

表1 Web服务器日志数据内容

fan
类别 详情
主机IP 128.227.49.92
时间 05/Aug/2014:10:26:42 +0800
方法 GET
URL /extra/res/libs/kendo/extensions/kendo.extension.ui.js
协议 HTTP/1.1
状态 200
文件大小 15072
访问来源 http://www.geodata.cn/extra/TopicsWin2/pro3.jsp
客户端 Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0

2.1.2 注册用户服务记录

注册用户服务记录数据记录了注册用户对共享数据的在线下载、离线申请等活动。本文获取了2014年的国家地球系统科学数据共享平台网站注册用户数据下载日志。该日志共有170 809条记录,包含了用户名、IP地址、下载或申请的数据名称等信息。
2.1.3 用户注册信息

用户注册信息在数据挖掘过程中提供了重要的用户外在属性信息,为用户行为的解释提供依据,也可用于用户的分类。本文将采用匿名的用户注册信息,作为辅助数据,判定用户来源。用户注册信息包括用户的学历、职业、联系方式、所在机构等信息。

2.2 研究方法

2.2.1 数据预处理
数据预处理是Web使用挖掘过程中最繁琐、最耗时的部分。本文的数据预处理分为以下5部分。
(1)数据清洗
数据清洗的目的是去除与数据挖掘任务无关的日志记录项,如浏览器对图片、声音、CSS样式文件的请求,以及网络爬虫的请求等[23]。本文采用的数据清洗算法包括以下3个步骤:
① 清理无用的网络请求记录,如对图像文件、声音文件、样式文件的请求。该过程通过检查URL项的后缀名来实现。所有包含“.jpg”、“.gif”、“.map”、“.mp3”、“.css”等后缀名的请求都应被清理。
② 对网络爬虫请求的清理。本文采用3种启发式方法找到爬虫请求日志项:被请求页面为“robots.txt”的日志项;若用户主机识别信息与已知的网络爬虫(百度、谷歌、搜狗等)用户代理匹配,则使用正则表达式识别常见搜索引擎的爬虫请求;通过计算浏览速度判断是否是爬虫请求项。方法是“浏览速度BS=已查看页面数/会话时间”。如果 BS 超过一个阈值 t ,并且一个会话内访问页面数超过一个阈值 n ,就认为该日志项为爬虫请求项。本研究取阈值 t 为2秒/页, n 为100。
③ 错误请求项的清洗。清洗错误请求的方法利用请求状态码来完成。所有日志项状态码小于200或大于400的都是不成功的请求,应被清理。
(2)用户识别
用户识别是指对网站匿名访问用户的区分过程。由于本文采用的Web服务器日志数据并不包含用户认证信息,因此对用户的识别是首要解决的问题。用户识别最准确的方法是使用cookie信息,但cookie信息在本文中并不可用。另一个常用的用户识别方法是通过用户IP。然而,仅仅使用IP还不足以准确识别单独的用户,这是因为代理服务器会产生动态的IP分配给用户[24]。在代理服务器将同一IP地址分配给多台计算机使用的情况下,通过日志中的访问来源(referrer)项及网站拓扑结构检查用户是否能通过近期历史访问页面链接到当前请求页面。本文开发了一种启发式方法来识别用户,步骤如下:
① 出现一个新的IP地址,假定产生一个新的用户。
② 在步骤1辨别出的用户中,如果同一IP地址对应的浏览器或者操作系统不同,则产生新的用户。
③ 在步骤2辨别出的用户中,如果某一用户请求的URL不能通过任何其在30 min内访问过的网页访问,则产生新的用户。
研究对象网站的导航特点是基于数据分类导航或通过关键字检索获取数据页面,数据页面间超链接较少,网站拓扑结构较易区分同一IP下不同的用户访问。因此,经过步骤3能较准确地区分同一局域网内的不同用户。
(3)用户位置识别
用户位置信息可通过用户的IP地址获取。本研究采用ipinfo.io提供的IP地址位置查询服务,获取用户的地理位置信息。该服务能提供用户所使用的网络服务提供商所在位置信息,并能提供用户所使用的网络服务商名称。ipinfo.in对IP地址查询返回JSON格式信息,包括IP地址所在的国家、城市、经纬度、主机名称等信息。
(4)会话识别
会话识别是将用户对网站的点击流分割为访问单元的过程。一个会话可定义为一个用户在某段时间对网站进行一次访问所浏览的页面序列。最常用的方法是时间窗口法。设定一个时间阈值法来确定用户会话(如30 min),如果某次用户访问时间超过这个阈值,就开始一个新的会话。Berendt等[25]通过比较研究发现基于来源页面(referrer)启发式算法(Referrer-based heuristic algorithm)有较好的识别率。该方法在一定时间窗口的基础上考虑来源页面是否出现在最近的访问记录中,可视为时间窗口法的改进方法。本文采用该方法进行会话识别。
由于真实的用户会话无法获取,会话识别的准确度无法通绝对误差的方式来衡量。但网络用户会话体现出固有的分布规律可用来对会话识别精度做出评估。Levene[26]等发现网络用户会话长度分布符合逆幂律分布。通过会话识别的会话长度分布与逆幂律分布的接近程度可一定程度上衡量会话识别质量。
(5)数据建模
Web服务器日志数据在经过预处理步骤后,得到一个有n个页面访问的集合, P = p 1 , p 2 , , p n ,以及一个包含m个用户会话的集合,其中T中的 t i P的子集。基于此,可以将每个用户会话t表示为一个长度为l的有序对序列,如式(1)所示。
t = p 1 t , w p 1 t , p 2 t , w p 2 t , , p l t , w p l t (1)
式中: p i t = p j ( j = 1,2 , , n ) ; w p i t 是会话t中的页面访问 p i t 的权重[27]。在本文中,权重取二值型。1表示用户对某一页面的访问,0表示用户对该页面未被访问。基于上述所给出的用户会话t,可以将每个用户的会话表示成一个n维空间的页面访问向量tv,可表示为式(2)。
tv = ( w p 1 t , w p 2 t , , w p n t ) (2)
p j 在会话t中出现,则 w p i t =1, ( j = 1,2 , , n ) ,否则 w p i t =0。所有的用户会话集合可以表示为一个 m × n 的用户页面访问矩阵,如图2所示。
Fig.2 User pageview matrix (in this case, A, B and C represent different webpages)

图2 页面访问会话矩阵示例(A、B、C等表示不同的页面)

图2所示的用户页面访问矩阵能满足普通的用户行为模式挖掘,但若需进行用户行为空间模式挖掘,还需在矩阵中增加用户的位置信息。对于每一个用户会话,一个空间信息增强型的会话向量 st 可表示为式(3)。
st = ( x , y , tv ) (3)
式中:xy表示用户的地理坐标信息;tv为用户页面访问向量。最终的页面访问向量模型可在三维空间里的多维向量,如图3所示。经过上述的数据预处理步骤,将最终的用户行为记录数据储存在MySQL数据库和ArcGIS Geodatabase数据库,以分别满足不同的数据挖掘任务。
Fig.3 An example of a georeferenced user transaction data model, the blue line represents the transaction vector of a user located at 30°E, 45°N

图3 空间信息增强型用户会话向量模型

2.2.2 用户行为空间“热点”分析
空间“热点”分析可用来识别地理空间上有统计显著性的“热点”和“冷点”地区。本文在数据预处理后将用户行为记录储存于空间数据库,视每个用户为一个空间要素,其行为统计数据为其空间属性,采用Getis-Ord Gi*[28]指数方法分别对用户对网站网页的访问次数、产生的会话次数、以及对数据的访问个数进行空间“热点”分析。Getis-Ord Gi*统计也别称为“热点”分析,是General G统计的一种改进方法[29]。该空间聚类分析在流行病学[30]、降水[31]、农业分析[32]等领域得到广泛应用。Getis-Ord Gi*的计算公式可表示为式(4)。
G i * = j = 1 n w i , j - X ¯ j = 1 n w i , j S n j = 1 n w i , j 2 - j = 1 n w i , j 2 n - 1 (4)
式中: w i , j 是要素 i j 之间的空间权重; n 为要素总数。 X ̅ S 的计算方法分别如式(5)和(6)所示。
X ¯ = j = 1 n x j n (5)
S = j = 1 n x j 2 n - X ¯ 2 (6)
式中: x j 是空间要素 j 的属性值。
空间权重 w i , j 采用反距离,即空间距离的倒数。通过式(4)计算所得Gi*值即为每个空间要素 z 的得分。 z 得分越高,则高值(“热点”)的聚类越显著。而统计学显著的负 z 值得分越低,低值(“冷点”)聚类越显著。该分析的结果中需考虑的另一个值是 p 。该值表示所发现的模式是由某一随机过程产生的概率。 p 值本身为正态分布,并与分析结果产生的z值关联。最终可得到一定 p 值范围下的 z 值范围。最终所得的模式的置信度为 ( 1 - p ) 。本文只考虑置信度不低于90%的“热点”或“冷点”模式。
2.2.3 关联规则挖掘
关联规则是最常用的用户行为模式挖掘项目,主要用来发现用户的页面浏览之间的潜在关系。一个典型的关联规则可以表述为式(7)。表示用户在一定的支持度(support)和置信度(confidence)下,访问了 A , B 也会访问 C
A , B C ( support , confidence ) (7)
常见的关联规则挖掘算法有Apriori算法[33]、Eclat算法[34]、FP-growth[35]算法等。其中,FP-growth算法是一种较新的关联规则挖掘算法。该算法使用树状数据结构(FP-tree)大大提高了数据扫描效率,克服了Apriori算法的效率问题。因此,本文采用FP-growth算法进行用户对数据浏览及下载的关联规则挖掘。高频高频高频高频该算法的思想是将数据库压缩到高频模式树(FP-tree),只需2次遍历数据库。算法的实现步骤为:
(1)扫描事务集,找出所有高频项 F ,并用 F 中的项,按支持度计数降序生成高频项头表。
(2)再次扫描事务集,生成FP-tree,并填写头表中的指针。
(3)按头表中从表尾至表头的顺序,用FP-tree生成以每一个项为后缀项的条件模式基,并建立其条件模式树。
(4)在条件模式树上递归地进行挖掘,获得高频模式。
本文分别对用户数据浏览行为以及注册用户数据下载进行关联规则挖掘,以探索用户的数据需求特征。需要指出的是,对于用户数据浏览行为模式挖掘,本文在进行关联规则挖掘时首先区分活跃用户和非活跃用户。区分二者的方法是探查用户在2014年年度访问数据集数是否大于或等于某一经验值(本文设置为10)。若大于10,则认为该用户为活跃用户。用户数据浏览关联规则挖掘是基于活跃用户所产生的数据访问记录。非活跃用户的访问记录将被剔除,以减小数据的不确定性。

3 结果与分析

3.1 预处理结果

原始的Web服务器日志数据为11 062 608条。经过数据清洗,获得的有效日志记录为2 845 150条,约为原始数据的四分之一;识别的会话数目为448 495;独立用户为76 111,其中可识别用户位置的个数为76 069。详细信息如表2所示。
Tab.2 Statistics of data preprocessing results

表2 数据预处理结果统计

原始日志记录 清洗后记录 用户数 会话数 识别位置
11 062 608 2 845 150 76 111 448 495 76 069
采用用户会话长度概率分布与幂率分布的符合程度来衡量识别的准确度,结果如图4所示。会话长度概率分布拟合函数如式(8)所示。
p n = 1.141 × n - 1.966 (8)
拟合方程的确定系数(R2)为0.98, p < 0.001 ,拟合在99%的显著性水平下显著。该结果表明该方程的拟合度极高,用户会话长度概率分布基本符合幂律分布规律,数据预处理的结果可信。
Fig.4 Distribution of the session length probability

图4 用户会话长度概率分布拟合曲线

3.2 用户行为空间分析

3.2.1 用户空间分布
本文识别的国内用户总数为76 111人,成功定位位置的用户为76 069人。用户在全国各省市均有分布,其中用户最多的三个省(市)为北京市(16 432人)、山东省(6424人)、江苏省(4357人)。各省市用户分布如图5所示,图中所示的黑点表示用户的聚集地。聚集点不同的大小表示不同的用户聚集规模。而各省用户数量以橙色的深浅表示,色彩越深表示用户数量越多。
Fig.5 User distribution in China

图5 国内用户数量分布

从地球系统科学数据共享平台网站用户注册资料库可知,用户主要来自高校及研究院所。根据这一用户特点,本文收集了国家统计局发布的2013年各省市“普通高校在校学生数”(港澳台除外)。将此数据与地球系统科学数据共享平台网站各省市用户进行Pearson相关性分析。所得相关系数为0.324,P值为0.075,表明二者无明显相关性。考虑到科学数据的用户更有可能来自研究型大学,本文究同时收集了中国“211工程”高校里的综合性以及理工科类大学2014年在校本科及研究生学生数,并以省为地域范围进行统计。“211工程”高校能较好地代表中国的研究型大学。此项数据与该共享平台用户的Pearson相关性分析结果为,相关系数0.792,且P值小于0.01,表明二者有显著的正相关性。结果与科学数据的用户更有可能来自研究型大学的假设相符合。
3.2.2 空间“热点”分析
本文分别对用户网页浏览数、会话次数、数据集浏览个数进行空间“热点”分析。用户的网页浏览数体现了网站的流量来源以及用户的浏览习惯;会话次数则反映用户使用网站的次数,可反映用户的活跃度;用户对数据集浏览个数则反映用户对数据需求量。空间“热点”分析可识别出上述3个方面的“热点”地区,以期对特定“热点”区域的服务策略指导,以及对识别“冷点”地区进行宣传推广提供参考。
网页浏览数“热点”分析结果如图6所示,由结果可知用户对网页浏览数的“热点”地区主要分布在北京市、天津市、河北省北部以及四川省部分城市等多个区域,表明这些区域用户较集中地对网站有较高的访问量;“冷点”地区主要分布在河北省南部、河南省北部、山东省西部、广东省、台湾地区等,表明这些区域用户对网站访问不活跃。其他地区用户对网站访问量的分布较随机,未表现出明显空间聚集特征。
Fig.6 Hotspot analysis of user pageviews

图6 用户网页浏览数“热点”分析

用户会话可反映用户对网站的使用次数。一个会话代表了用户对网站的一次使用。用户会话数“热点”分析结果如图7所示,“热点”地区主要分布在北京市、河北省北部以及江苏省、浙江省,“冷点”地区主要分布较集中,主要在河南省北部、山东省西部,以及台湾地区。
Fig.7 Hotspot analysis of user sessions

图7 用户会话数“热点”分析

用户数据集浏览数反映了用户关注的数据集数量,一定程度反映了用户在地球系统科学领域的研究热度。图8展示了用户数据集浏览数的“热点”分析结果。由图可见,用户浏览的数据集个数较多的“热点”区域有北京市、天津市、河北省北部、陕西省、江苏省、浙江省等,与图7中用户会话“热点”分布类似。不同的是,数据集浏览数“冷点”地区分布广泛,包括河南省、山西省、山东省西部、四川省、广东省、福建省、东北部分地区,以及台湾地区。而且数据集浏览数“热点”表现出极高的置信度(99%),“冷点”地区也普遍表现非常高的置信度,空间聚类模式显著。
Fig.8 Hotspot analysis of datasets visits

图8 用户数据集浏览数“热点”分析

3.3 关联规则挖掘

3.3.1 数据访问关联规则
关联规则挖掘算法的第一步是找出数据库中的高频项目集。本文分别对所有的用户访问以及活跃用户的访问进行了高频项目集的挖掘。最小支持度和置信度的设置取决于挖掘任务的需要。本文为了展示数个相对高频的数据访问集合,在实验程序中5%为单位递增进行试验,选取合适的最小支持度。实验发现将用户高频项目集挖掘的最小支持度设置为10%,活跃用户的最小支持度设置为25%,挖掘的高频项目集个数适中。挖掘结果如表3、4所示。
Tab.3 Frequent itemsets for datasets visits of all users (S≥10%)

表3 所有用户数据访问高频项目集(S≥10%)

项目集 支持度(S)/(%) 内容描述
100101-22 27.1 中国1:400万地貌图(形态)
100101-2 12.9 中国1:400万资源环境数据(中国地形,1988年)
100101-18 11.6 全国土地利用数据库(分省:1980s,1987-2001年;分县:1980s)
100101-38 10.8 全国1 km网格人口数据(1995,2000, 2003,2005和2010年)
100101-66 10.6 中国1:400万全要素基础数据 (1970s-1990s)
Tab.4 Frequent itemseds for datasets visits ofactive users (S≥25%)

表4 活跃用户数据访问高频项目集(S≥25%)

项目集 支持度(S)/(%) 内容描述
100101-18 34.1 全国土地利用数据库(分省:1980s,1987-2001年;分县:1980s)
100101-38 32.4 全国1 km网格人口数据(1995、2000、2003、2005和2010年)
100101-2 30.7 中国1:400万资源环境数据(中国地形,1988年)
100101-3 29.6 1996年浙江省1:25万数字化土地利用现状图
100101-30 29.2 全国多年平均降雨分布图(1 km)(建站到1996年)
100101-38、100101-18 28.0 全国1 km网络人口数据、全国土地利用数据库
100101-18、100101-2 27.5 全国土地利用数据库、中国1:400万资源环境数据
100101-30、100101-18 27.2 全国多年平均降雨分布图、全国土地利用数据库
100101-66 27.1 中国1:400万全要素基础数据(1970 s-1990 s)
100101-18、100101-3 26.8 全国土地利用数据库、1996年浙江省1:25万数字化土地利用现状图
表3、4结果可知,所有用户中最常访问的数据是100101-22(中国1:400万地貌图),访问比例达27.1%。而活跃用户最常访问的数据是100101-18:全国土地利用数据库(分省:1980s,1987-2001年;分县:1980s)。值得注意的是,在活跃用户最常访问的数据中,100101-22并未出现,说明该数据虽然需求量较大,但并未受活跃用户的普遍关注,反映出此类数据的需求特征。而100101-3:浙江省1:25万数字化土地利用现状图也受到将近30%活跃用户关注,是活跃用户数据访问高频项目集前10中唯一的非全国性数据。经过数据探查,这一结果与华东地区,包括浙江省用户,对数据访问活跃比例较高可能存在因果关系。这一结果也与图8所示用户数据集浏览数热点在浙江出现明显的聚集现象相一致。
对活跃用户,满足表4中支持度,且置信度不低于90%关联规则挖掘结果如表5所示。其中置信度最高的关联规则可解读为活跃用户在同时访问100101-30(全国多年平均降雨分布图(1 km))以及100101-3(1996年浙江省1:25万数字化土地利用现状图)后,有高达98.5%的概率会访问100101-18(全国土地利用数据库(分省:1980s,1987-2001年;分县:1980s))。在支持度大于25%,置信度大于90%的情况下,可用的关联规则达18个。这些关联规则可用于对用户进行精准数据推荐服务,以及网站导航设计参考等。
Tab.5 Association rules (C≥90%)

表5 关联规则(C≥90%)

关联规则 置信度(C)/(%)
100101-30 ==> 100101-2 90.4
100101-3==> 100101-18 90.8
100101-38、 100101-18==> 100101-2 91.4
100101-18、100101-2==> 100101-3 92.4
100101-2、100101-18 ==> 100101-38 92.9
100101-30、100101-18==> 100101-3 93.0
100101-30 ==> 100101-18 93.1
100101-18、100101-3==> 100101-30 94.1
100101-18、100101-2==> 100101-30 94.2
100101-18、100101-3 ==> 100101-2 94.6
100101-30、100101-2==> 100101-3 95.4
100101-30、100101-18 ==> 100101-2 95.4
100101-2、100101-3 ==>100101-30 96.9
100101-38、100101-2 ==> 100101-18 97.2
100101-2、100101-3==> 100101-18 97.8
100101-30、100101-3 ==> 100101-2 98.2
100101-30、100101-2 ==> 100101-18 98.2
100101-30、100101-3==> 100101-18 98.5

3.3.2 数据下载或申请关联规则

地球系统科学数据共享平台的共享数据服务分为2种,在线下载和离线申请。本文将注册用户在2014年产生的在线下载或者离线申请数据所产生的记录输入关联规则挖掘算法以挖掘数据使用上的关联规则。挖掘结果显示用户的数据下载或申请并未表现出支持度高的显著高频项目集。以最小支持度10%进行高频项目集挖掘,只有一个数据集满足挖掘条件,且无可用的关联规则。注册用户下载数据中,最受欢迎的是100101-66(中国 1:400万全要素基础数据)。
表6列出了排名前5的高频项目,与表3中所有用户数据浏览高频项目集比较,注册用户最常下载或申请的数据与用户最常浏览的数据有较好的吻合度。在支持度前5的高频项目中,100101-66(中国1:400万全要素基础数据)、100101-38(全国1 km网格人口数据)、100101-18(全国土地利用数据库)同时出现在以上2种高频项目集列表中,表明这3个数据集不论在匿名用户中还是在注册用户中,都最受欢迎。
Tab.6 Frequent itemsets for datasetsdownloads or application (top 5)

表6 注册用户数据下载或申请高频项目集(前5)

项目集 支持度(S)/(%) 内容描述
100101-66 13.7 中国1:400万全要素基础数据(1970s-1990s)
100101-38 9.6 全国1 km网格人口数据(1995、2000、2003、2005和2010年)
100101-11860 8.1 全国1:25万土地覆被数据(1980s,2005年)
100101-18 8.0 全国土地利用数据库(分省:1980s,1987-2001年;分县:1980s)
100101-29 7.3 陆地卫星MSS/TM/ETM+(1973-2008年、覆盖全国)

4 讨论与结论

本文以地球系统科学数据共享平台网站2014年Web服务器日志数据和用户服务记录数据为基础,提取用户行为数据,经过数据预处理并对数据进行建模,输入空间数据库和关系数据库。首先使用空间数据挖掘方法Getis-Ord Gi*指数发现中国境内用户行为的“热点”地区以及区域性差异。然后进一步深入探究用户——数据关系的潜在规律,对活跃用户的数据浏览行为以及数据下载行为进行关联规则挖掘发现一些有价值的关联规则,可进一步用于用户数据推荐等个性化服务。基于数据挖掘结果可得出以下结论:
(1)地球系统科学数据共享平台国内用户在国内各省市均有分布。用户最多的3个省(市)分别为北京市、山东省、江苏省。将平台网站用户与国内高校在校学生进行Pearson相关分析表明二者无明显相关性,但与研究型高校学生人数有显著正相关性,说明该平台在研究型大学有一定用户基础,但在教学型高校有较大用户开发潜力。此挖掘结果可为该数据共享平台宣传推广提供参考。
(2)分别对网页浏览数、会话数、数据集浏览数3个指标进行“热点”分析,可探寻用户行为的“热点”地区和“冷点”地区。用户对网站网页点击的“热点”地区有北京市、天津市、河北省北部以及四川省部分城市等多个区域;用户会话的“热点”地区有北京市、河北省北部以及江苏省、浙江省等,表明这地地区用户对网站有较高的使用频率;而用户对数据集浏览个数的“热点”地区有北京市、天津市、河北省北部、陕西省、江苏省、浙江省,表明这些地区在地球系统科学领域研究较为活跃,且模式的置信度极高。
(3)本文分别对用户数据浏览以及数据下载或申请进行了关联规则挖掘。挖掘结果反映了用户的数据需求特征。对于用户数据浏览,挖掘出多条置信度高的关联规则,可作为知识库,用于数据推荐服务。对于数据下载或申请,挖掘结果显示并无显著的关联规则可用。注册用户最常下载或申请的数据与用户最常浏览的数据有较好的吻合度。在高频项目集的挖掘结果中,100101-66(中国1:400万全要素基础数据)、100101-38(全国1 km网格人口数据)、100101-18(全国土地利用数据库)这3个数据集不论在匿名用户中还是在注册用户中,都是需求度最高的。100101-3(浙江省1:25万数字化土地利用现状图)也以较高的频率出现在数据访问记录中,这可能与浙江省用户数据浏览活跃性较高有关。
本文结合Web使用挖掘及空间数据挖掘方法,展示了网络用户行为地理空间模式挖掘方法,并对国家地球系统科学数据共享平台2014年有用户行为特征进行分析。今后将在这一方法探索的基础上,收集和整理该国家平台的用户多年历史日志信息,分析和发现用户行为模式的时间动态变化,完善用户行为建模,为精准化、个性化的用户服务提供支撑。
致谢:感谢国家科技基础条件平台——地球系统科学数据共享平台为本文提供数据支持。

The authors have declared that no competing interests exist.

[1]
Tenopir C, Allard S, Douglass K.Data sharing by scientists: practices and perceptions[J]. PLoS ONE, 2011,6(6):e21101.Background: Scientific research in the 21st century is more data intensive and collaborative than in the past. It is important to study the data practices of researchers 鈥揹ata accessibility, discovery, re-use, preservation and, particularly, data sharing. Data sharing is a valuable part of the scientific method allowing for verification of results and extending research from prior results. Methodology/Principal Findings: A total of 1329 scientists participated in this survey exploring current data sharing practices and perceptions of the barriers and enablers of data sharing. Scientists do not make their data electronically available to others for various reasons, including insufficient time and lack of funding. Most respondents are satisfied with their current processes for the initial and short-term parts of the data or research lifecycle (collecting their research data; searching for, describing or cataloging, analyzing, and short-term storage of their data) but are not satisfied with long-term data preservation. Many organizations do not provide support to their researchers for data management both in the short- and long-term. If certain conditions are met (such as formal citation and sharing reprints) respondents agree they are willing to share their data. There are also significant differences and approaches in data management practices based on primary funding agency, subject discipline, age, work focus, and world region. Conclusions/Significance: Barriers to effective data sharing and preservation are deeply rooted in the practices and culture of the research process as well as the researchers themselves. New mandates for data management plans from NSF and other federal agencies and world-wide attention to the need to share and preserve data could lead to changes. Large scale programs, such as the NSF-sponsored DataNET (including projects like DataONE) will both bring attention and resources to the issue and make it easier for scientists to apply sound data management principles.

[2]
刘闯. 美国国有科学数据共享管理机制及对我国的启示[J].中国基础科学,2003(1):34-39.正 数字化的科学数据管理是人类社会进入信息时代以来的新生事物,科学数据发展的速度常常超出科学家们预料,因此,在很多情况下人们尚未准备好很多事情就已经发生了。这种现象在发展中国家尤其突出,我国也遇到了类似的挑战。美国是世界上科学数据拥有量最多的国家,特别是在地球科学和

DOI

[ Liu C.Regulatory mechanisms of national science data sharing of USA and the inspiration to China[J]. China Basic Science, 2003,1:34-39. ]

[3]
徐冠华. 实施科学数据共享增强国家科技竞争力[J].中国基础科学,2003(1):5-9.正 21世纪是科学技术突飞猛进的世纪,科学技术已成为社会变革和发展的主导力量,人类的未来和国家的繁荣将更加依赖于科技创新及科技产业化。正如江泽民同志所指出:“一个没有创新能力的民族难以屹立于世界民族之林”。加快国家创新体系建设,全面提高科技创新能力,不仅是科技发展的要

DOI

[ Xu G H.Emhancing national science competitiveness with science data sharing[J]. China Basic Science, 2003,1:5-9. ]

[4]
Facca F M, Lanzi P L.Mining interesting knowledge from weblogs: asurvey[J]. Data & Knowledge Engineering, 2005,53(3):225-241.lt;h2 class="secHeading" id="section_abstract">Abstract</h2><p id="">Web Usage Mining is that area of Web Mining which deals with the extraction of interesting knowledge from logging information produced by Web servers. In this paper we present a survey of the recent developments in this area that is receiving increasing attention from the Data Mining community.</p>

DOI

[5]
Sajid N A, Zafar S, Asghar S.Sequential pattern finding: A survey[C]. 2010 International Conference on Information and Emerging Technologies (ICIET), 2010:1-6.

[6]
Wang Y T, Lee A J T. Mining Web navigation patterns with a path traversal graph[J]. Expert Systems with Applications, 2011,38(6):7112-7122.Understanding the navigational behaviour of website visitors is a significant factor of success in the emerging business models of electronic commerce and even mobile commerce. However, Web traversal patterns obtained by traditional Web usage mining approaches are ineffective for the content management of websites. They do not provide the big picture of the intentions of the visitors. The Web navigation patterns, termed throughout-surfing patterns (TSPs) as defined in this paper, are a superset of Web traversal patterns that effectively display the trends toward the next visited Web pages in a browsing session. TSPs are more expressive for understanding the purposes of website visitors. In this paper, we first introduce the concept of throughout-surfing patterns and then present an efficient method for mining the patterns. We propose a compact graph structure, termed a path traversal graph, to record information about the navigation paths of website visitors. The graph contains the frequent surfing paths that are required for mining TSPs. In addition, we devised a graph traverse algorithm based on the proposed graph structure to discover the TSPs. The experimental results show the proposed mining method is highly efficient to discover TSPs. (C) 2010 Elsevier Ltd. All rights reserved.

DOI

[7]
Bayir M A, Toroslu I H, Demirbas M, et al.Discovering better navigation sequences for the session construction problem[J]. Data & Knowledge Engineering, 2012,73(2):58-72.In this paper, we propose a novel page view based session model and session construction method to address the Web Usage Mining (WUM) problem. Unlike the simple session models, where sessions are sequences of web pages requested from the server (or served from a browser/proxy cache) and viewed in the browser (which may not guarantee a direct relationship between subsequent web pages in the session), we define a more realistic session model in which a session is a set of paths traversed in the web graph that corresponds to a user navigation performed by following links on web pages. We define the session construction process from raw server logs as a new graph problem and present a novel algorithm, Smart-SRA (Smart Session Reconstruction Algorithm), to solve this problem efficiently. An experimental evaluation based on data collected from real web access scenarios showed that Smart-SRA produces more accurate user sessions than the session construction methods found in the literature.

DOI

[8]
Chen L, Bhowmick S S, Nejdl W.COWES: Web user clustering based on evolutionary web sessions[J]. Data & Knowledge Engineering, 2009,68(10):867-885.lt;h2 class="secHeading" id="section_abstract">Abstract</h2><p id="">As one of the most important tasks of Web Usage Mining (WUM), web user clustering, which establishes groups of users exhibiting similar browsing patterns, provides useful knowledge to personalized web services and motivates long term research interests in the web community. Most of the existing approaches cluster web users based on the snapshots of web usage data, although web usage data are evolutionary in the nature. Consequently, the usefulness of the knowledge discovered by existing web user clustering approaches might be limited. In this paper, we address this problem by clustering web users based on the evolution of web usage data. Given a set of web users and their associated historical web usage data, we study how their usage data change over time and mine evolutionary patterns from each user&rsquo;s usage history. The discovered patterns capture the characteristics of changes to a web user&rsquo;s information needs. We can then cluster web users by analyzing common and similar evolutionary patterns shared by users. Web user clusters generated in this way provide novel and useful knowledge for various personalized web applications, including web advertisement and web caching.</p>

DOI

[9]
Dimopoulos C, Makris C, Panagis Y, et al.A web page usage prediction scheme using sequence indexing and clustering echniques[J]. Data & Knowledge Engineering, 2010,69(4):371-382.lt;h2 class="secHeading" id="section_abstract">Abstract</h2><p id="">In this paper we consider the problem of web page usage prediction in a web site by modeling users&rsquo; navigation history and web page content with weighted suffix trees. This user&rsquo;s navigation prediction can be exploited either in an on-line recommendation system in a web site or in a web page cache system. The method proposed has the advantage that it demands a constant amount of computational effort per one user&rsquo;s action and consumes a relatively small amount of extra memory space. These features make the method ideal for an on-line working environment. Finally, we have performed an evaluation of the proposed scheme with experiments on various web site log files and web pages and we have found that its quality performance is fairly well and in many cases an outperforming one.</p>

DOI

[10]
Narvekar M, Banu S S.Predicting user's Web navigation behavior using hybrid approach[J]. Procedia Computer Science, 2015,45:3-12.World Wide Web is growing rapidly with different kinds of websites making it complex along with increasing traffic on the web. However predicting what the user wants becomes very difficult .There are various prediction challenges which are faced, some of them includes long training time, more prediction time, low prediction accuracy, memory limitation etc. The System aims to increase the prediction accuracy particularly when there are many prediction models to consult. The System also aims to reduce the complexity of prediction and yield efficient result and make the prediction user friendly as well minimize Miss-Prediction. The Hybrid model developed combines Markov model as well as Hidden Markov Model which gives user the list of web pages of their interest. We have used various kinds of datasets to analyze, compare and show the effectiveness of Hybrid model using various parameters such as Accuracy, Precision and Miss-Prediction.

DOI

[11]
Mobasher B, Cooley R,Srivastava J.Automatic personalization based on Web usage mining[J]. Communications of the ACM, 2000,43(8):142-151.

[12]
Park D H, Kim H K, Choi I Y, et al.A literature review and classification of recommender systems research[J]. Expert Systems with Applications, 2010,39(11):10059-10072.

[13]
Pierrakos D, Paliouras G, Papatheodorou C, et al.Web usage mining as a tool for personalization: a survey. user modeling and user adapted interaction, 2003,13(4):311-372.This paper is a survey of recent work in the field of web usage mining for the benefitof research on the personalization of Web-based information services. The essence of personalization is the adaptability of information systems to the needs of their users. This issue is becoming increasingly important on the Web, as non-expert users are overwhelmed by the quantity of information available online, while commercial Web sites strive to add value to their services in order to create loyal relationships with their visitors-customers. This article views Web personalization through the prism of personalization policies adopted by Web sites and implementing a variety of functions. In this context, the area of Web usage mining is a valuable source of ideas and methods for the implementation of personalization functionality. We therefore present a survey of the most recent work in the field of Web usage mining, focusing on the problemsthat have been identified and the solutions that have been proposed.

DOI

[14]
Carmona C J, Ramírez-Gallego S, Torres F, et al.Web usage mining to improve the design of an e-commerce website: OrOliveSur com. expert systems with applications, 2012,39(12):11243-11249.Web usage mining is the process of extracting useful information from users history databases associated to an e-commerce website. The extraction is usually performed by data mining techniques applied on server log data or data obtained from specific tools such as Google Analytics. This paper presents the methodology used in an e-commerce website of extra virgin olive oil sale called www.OrOliveSur.com. We will describe the set of phases carried out including data collection, data preprocessing, extraction and analysis of knowledge. The knowledge is extracted using unsupervised and supervised data mining algorithms through descriptive tasks such as clustering, association and subgroup discovery; applying classical and recent approaches. The results obtained will be discussed especially for the interests of the designer team of the website, providing some guidelines for improving its usability and user satisfaction. (C) 2012 Elsevier Ltd. All rights reserved.

DOI

[15]
Yin P Y, Guo Y M.Optimization of multi-criteria website structure based on enhanced tabu search and web usage mining[J]. Applied Mathematics and Computation, 2013,219(24):11082-11095.With the rapid development in World Wide Web (WWW) technology, the number of webpages and the volume of information content have been overwhelming. It becomes increasingly important to help users find relevant webpage and information more easily and quickly. This situation causes widespread attention in constructing adaptive websites which automatically reorganize the structure or content by learning from the users' browsing behaviors, as such the usage of the websites is improved. In this study we propose a new formulation for the website structure optimization (WSO) problem based on a comprehensive survey of existing works and practice considerations. An enhanced tabu search (ETS) algorithm is proposed with advanced search features of multiple neighborhoods, adaptive tabu lists, dynamic tabu tenure, and multi-level aspiration criteria. The experimental result on 24 real-world problem instances shows that the proposed US algorithm can obtain a better value of web usage estimation than a genetic algorithm method. Moreover, ETS is computationally efficient due to the strategy that handles problem constraints on-the-fly when constructing the solution. (c) 2013 Elsevier Inc. All rights reserved.

DOI

[16]
Song Q, Shepperd M J.Mining Web browsing patterns for E-commerce[J]. Computers in Industry, 2006,57(7):622-630.Web user clustering, Web page clustering, and frequent access path recognition are important issues in E-commerce. They can be used for the purposes of marketing strategies and product offerings, mass customization and personalization, and Web site adaptation. In this paper, we view the topology of a Web site as a directed graph, and use a user's access information on all URLs of a Web site as features to characterize the user and use all users鈥 access information on a URL as features to characterize the URL. The user clusters and Web page clusters are discovered by both vector analysis and fuzzy set theory based methods. The frequent access paths are recognized based on Web page clusters and take into account the underlying structure of a Web site. Our method does not require the identification of user sessions from Web server logs, and both a user and a page can be assigned to more than one cluster. Our frequent access path identification algorithm is not based on sequential pattern mining, so it avoids the performance difficulties of the latter. We applied our algorithms to five real world data sets of different sizes. Our results show the effectiveness of the proposed algorithms with the fuzzy set theory based methods being slightly more accurate.

DOI

[17]
Lopes P, Roy B.Dynamic recommendation system using Web usage mining for e-commerce users[J]. Procedia Computer Science, 2015,45:60-69.E-commerce organizations are growing exponentially with time in terms of both business and data. Many organizations rely on these websites to attract new customers and retain the existing ones. In order to achieve this goal web log files can be used that records customer's access patterns. Using traditional web usage mining techniques in an enhanced manner valuable patterns and hidden knowledge can be discovered. This paper focuses on providing real time dynamic recommendation to all the visitors of the website irrespective of been registered or unregistered. Action based rational recommendation technique is proposed that makes use of lexical patterns to generate item recommendation. Effectiveness of the proposed system is evaluated by collecting real time E commerce data and comparing the system with user based and product based techniques. Results prove that the proposed system yield good quality accuracy and minimizes limitations of traditional recommendation system.

DOI

[18]
Hung Y S, Chen K L B, Yang C T, et al. Web usage mining for analyzing elder self-care behavior patterns[J]. Expert Systems with Applications, 2013,40(2):775-83.The rapid growth of the elderly population has increased the need to support elders in maintaining independent and healthy lifestyles in their homes rather than through more expensive and isolated care facilities. Self-care can improve the competence of elderly participants in managing their own health conditions without leaving home. This main purpose of this study is to understand the self-care behavior of elderly participants in a developed self-care service system that provides self-care service and to analyze the daily self-care activities and health status of elders who live at home alone.<br/>To understand elder self-care patterns, log data from actual cases of elder self-care service were collected and analysed by Web usage mining. This study analysed 3391 sessions of 157 elders for the month of March, 2012. First, self-care use cycle, time, function numbers, and the depth and extent (range) of services were statistically analysed. Association rules were then used for data mining to find relationship between these functions of self-care behavior. Second, data from interest-based representation schemes were used to construct elder sessions. The ART2-enhance K-mean algorithm was then used to mine cluster patterns. Finally, sequential profiles for elder self-care behavior patterns were captured by applying sequence-based representation schemes in association with Markov models and ART2-enhanced K-mean clustering algorithms for sequence behavior mining cluster patterns for the elders. The analysis results can be used for research in medicine, public health, nursing and psychology and for policy-making in the health care domain. (C) 2012 Elsevier Ltd. All rights reserved.

DOI

[19]
Munka M, Drl K M.Impact of different pre-processing tasks on effective identification of users' behavioral patterns in Web-based educational system[J]. Procedia Computer Science, 2011,4:1640-1649.Analyzing the unique types of data that come from educational systems can help find the most effective structure of the e-learning courses, optimize the learning content, recommend the most suitable learning path based on student's behavior, or provide inure personalized environment. We focus only on the processes involved in the data preparation stage of web usage mining. Our objective is to specify the inevitable steps that are required for obtaining valid data from the stored logs of the web-based educational system. We compare three datasets of different quality obtained from logs of the web-based educational system and pre-processed in different ways: data with identified users' sessions and data with the reconstructed path among course activities. We try to assess the impact of these advanced techniques of data pre-processing on the quantity and quality of the extracted rules that represent the learners' behavioral patterns in a web-based educational system. The results confirm some initial assumptions, but they also show that the path reconstruction among visited activities in c-leaning course has not statistically significant effect on quality and quantity of the extracted rules.

DOI

[20]
Romero C, Espejo P G, Zafra A, et al.Web usage mining for predicting final marks of students that use Moodle courses[J]. Computer Applications in Engineering Education, 2013,21(1):135-146.This paper shows how web usage mining can be applied in e-learning systems in order to predict the marks that university students will obtain in the final exam of a course. We have also developed a specific Moodle mining tool oriented for the use of not only experts in data mining but also of newcomers like instructors and courseware authors. The performance of different data mining techniques for classifying students are compared, starting with the student's usage data in several Cordoba University Moodle courses in engineering. Several well-known classification methods have been used, such as statistical methods, decision trees, rule and fuzzy rule induction methods, and neural networks. We have carried out several experiments using all available and filtered data to try to obtain more accuracy. Discretization and rebalance pre-processing techniques have also been used on the original numerical data to test again if better classifier models can be obtained. Finally, we show examples of some of the models discovered and explain that a classifier model appropriate for an educational environment has to be both accurate and comprehensible in order for instructors and course administrators to be able to use it for decision making. (C) 2010 Wiley Periodicals, Inc. Comput Appl Eng Educ 21: 135-146, 2013

DOI

[21]
王琨,郭风华,李仁杰,等.基于Tripadvisior的中国旅游地国际关注度及空间格局[J].地理科学进展,2014(11):1462-1473.用户贡献内容(UGC)已逐渐成为旅游行为与感知研究的重要数据源。区别于通常利用搜索引擎关键词数量描述网络关注度的方法,本文引入电子社区层次结构为权重因子,建立了基于社区UGC的旅游关注度模型,能够灵活调节模型表达的重点,优化计算结果。针对著名旅游电子社区Tripadvisor的研究发现,国外社区用户对中国旅游的关注呈现3个典型特征:1旅游关注集中在'长城、泰山、黄山、九寨沟、张家界'等少数旅游吸引物,和'北京、香港、上海、桂林'少量目的地城市;大量吸引物和目的地关注度较低,呈现'长尾现象'与极化特征。2吸引物与目的地城市的关注空间具有明显耦合性,关注度较高的吸引物多邻近或隶属于关注度较高的城市,如桂林阳朔、北京长城、成都都江堰和九寨沟、杭州西湖等。3旅游关注空间整体呈现出由高到低的'东—中—西'格局,与中国区域经济的'东—中—西'梯度格局基本耦合;北京、香港、广州、深圳、上海、成都等关注中心也与区域经济中心一致。旅游资源禀赋、电子口碑传播模式、地理区位、经济水平和关注者国家的文化背景、经济发展状况、地理区位等是影响旅游者关注度及其空间格局变化的主要因素。旅游关注度模型旨在解决互联网用户对区域旅游关注的定量计算问题,为基于互联网UGC的旅游地理学研究提供新思路。

DOI

[ Wang K, Guo F H, Li R J.Tourism attention degree about China from overseas and its spatial patterns based on Tripadvisor[J]. Progress in Geography, 2014(11):1462-1473. ]

[22]
Arbelaitz O, Gurrutxaga I, Lojo A, et al.Web usage and content mining to extract knowledge for modelling the users of the Bidasoa Turismo website and to adapt it[J]. Expert Systems with Applications, 2013,40(18):7478-7491.The tourism industry has experienced a shift from offline to online travellers and this has made the use of intelligent systems in the tourism sector crucial. These information systems should provide tourism consumers and service providers with the most relevant information, more decision support, greater mobility and the most enjoyable travel experiences. As a consequence, Destination Marketing Organizations (DMOs) not only have to respond by adopting new technologies, but also by interpreting and using the knowledge created by the use of these techniques. This work presents the design of a general and non-invasive web mining system, built using the minimum information stored in a web server (the content of the website and the information from the log files stored in Common Log Format (CLF)) and its application to the Bidasoa Turismo (BTw) website. The proposed system combines web usage and content mining techniques with the three following main objectives: generating user navigation profiles to be used for link prediction; enriching the profiles with semantic information to diversify them, which provides the DMO with a tool to introduce links that will match the users taste; and moreover, obtaining global and language-dependent user interest profiles, which provides the DMO staff with important information for future web designs, and allows them to design future marketing campaigns for specific targets. The system performed successfully, obtaining profiles which fit in more than 60% of cases with the real user navigation sequences and in more than 90% of cases with the user interests. Moreover the automatically extracted semantic structure of the website and the interest profiles were validated by the BTw DMO staff, who found the knowledge provided to be very useful for the future. (C) 2013 Elsevier Ltd. All rights reserved.

DOI

[23]
Cooley R, Mobasher B, Srivastava J.Data preparation for mining World Wide Web browsing patterns[J]. Knowledge and Information Systems, 1999,1(1):5-32.

[24]
Kosala R, Blockeel H.Web mining research: a survey[J]. Sigkdd Explorations, 2000,2(1):1-15.Abstract: With the huge amount of information available online, the World Wide Web is a fertile area for data mining research. The Web mining research is at the cross road of research from several research communities, such as database, information retrieval, and within AI, especially the sub-areas of machine learning and natural language processing. However, there is a lot of confusions when comparing research efforts from different point of views. In this paper, we survey the research in the area of Web mining, point out some confusions regarded the usage of the term Web mining and suggest three Web mining categories. Then we situate some of the research with respect to these three categories. We also explore the connection between the Web mining categories and the related agent paradigm. For the survey, we focus on representation issues, on the process, on the learning algorithm, and on the application of the recent works as the criteria. We conclude the paper with some research issues.

DOI

[25]
Berendt B, Mobasherb B, Nakagawa M, et al.The impact of site structure and user environment on session reconstruction in web usage analysis[A]. In: Zaïane O, Srivastava J, Spiliopoulou M, et al (eds.). WEBKDD 2002 - mining web data for discovering usage patterns and profiles[M]. Berlin: Springer Berlin Heidelberg, 2003,2703:159-179.

[26]
Levene M, Borges J, Loizou G.Zipf's law for Web surfers[J]. Knowledge and Information Systems, 2001,3(1):120-129.<a name="Abs1"></a> One of the main activities of Web users, known as &#8216;surfing&#8217; is to follow links. Lengthy navigation often leads to disorientation when users lose track of the context in which they are navigating and are unsure how to proceed in terms of the goal of their original query. Studying navigation patterns of Web users is thus important, since it can lead us to a better understanding of the problems users face when they are surfing. We derive Zipf's rank frequency law (i.e., an inverse power law) from an absorbing Markov chain model of surfers' behavior assuming that less probable navigation trails are, on average, longer than more probable ones. In our model the probability of a trail is interpreted as the relevance (or &#8216;value&#8217;) of the trail. We apply our model to two scenarios: in the first the probability of a user terminating the navigation session is independent of the number of links he has followed so far, and in the second the probability of a user terminating the navigation session increases by a constant each time the user follows a link. We analyze these scenarios using two sets of experimental data sets showing that, although the first scenario is only a rough approximation of surfers' behavior, the data is consistent with the second scenario and can thus provide an explanation of surfers' behavior.

DOI

[27]
Liu B.Web data mining (second edition)[M]. Chicago: Springer, 2011:540-542.

[28]
Getis A, Ord J K.The analysis of spatial association by use of distance statistics[J]. Geographical Analysis, 1992,24:189-206.Introduced in this paper is a family of statistics, G, that can be used as a measure of spatial association in a number of circumstances. The basic statistic is derived, its properties are identified,

DOI

[29]
Peeters A, Zude M, K Thner J, et al. Getis-Ord’s hot- and cold-spot statistics as a basis for multivariate spatial clustering of orchard tree data[J]. Computers and Electronics in Agriculture, 2015,111:140-150.Precision agriculture aims at sustainably optimizing the management of cultivated fields by addressing the spatial variability found in crops and their environment. Spatial variability can be evaluated using spatial cluster analysis, which partitions data into homogeneous groups, considering the geographical location of features and their spatial relationships. Spatial clustering methods evaluate the degree of spatial autocorrelation between features and quantify the statistical significance of identified clusters. Clustering of orchard data calls for an approach which is based on modeling point data, i.e. individual trees, which can be related to site-specific measurements. We present and evaluate a spatial clustering method using the Getis–Ord G i 65 statistic to the analysis of tree-based data in an experimental orchard. We examine the robustness of this method for the analysis of “hot-spots” (clusters of high data values) and “cold-spots” (clusters of low data values) in orchards and compare it to the k -means clustering algorithm, a widely-used aspatial method. We then present a novel approach which accounts for the spatial structure of data in a multivariate cluster analysis by combining the spatial Getis–Ord G i 65 statistic with k -means multivariate clustering. The combined method improved results by both discriminating among features values as well as representing their spatial structure and therefore represents a superior technique for identifying homogenous spatial clusters in orchards. This approach can be used as a tool for precision management of orchards by partitioning trees into management zones.

DOI

[30]
Feske M L, Teeter L D, Musser J M, et al.Including the third dimension: a spatial analysis of TB cases in Houston Harris county[J]. Tuberculosis, 2011,91(Supplement 1):24-33.To reach the tuberculosis (TB) elimination goals established by the Institute of Medicine (IOM) and the Centers for Disease Control and Prevention (CDC), measures must be taken to speed the currently stagnant TB elimination rate and curtail a future peak in TB incidence. Increases in TB incidence have historically coincided with immigration, poverty, and joblessness; all situations that are currently occurring worldwide. Effective TB elimination strategies will require the geographical elucidation of areas within the U.S. that have endemic TB, and systematic surveillance of the locations and location-based risk factors associated with TB transmission. Surveillance data was used to assess the spatial distribution of cases, the yearly TB incidence by census tract, and the statistical significance of case clustering. The analysis revealed that there are neighborhoods within Houston/Harris County that had a heavy TB burden. The maximum yearly incidence varied from 245/100,000鈥754/100,000 and was not exclusively dependent of the number of cases reported. Geographically weighted regression identified risk factors associated with the spatial distribution of cases such as: poverty, age, Black race, and foreign birth. Public transportation was also associated with the spatial distribution of cases and census tracts identified as high incidence were found to be irregularly clustered within communities of varied SES.

DOI PMID

[31]
Luković J, Blagojevć D, Kilibarda M, et al.Spatial pattern of north Atlantic oscillation impact on rainfall in Serbia[J]. Spatial Statistics, 2015,14(Part A):39-52.This study examines the spatial pattern of relationships between annual, seasonal and monthly rainfall in Serbia, and the North Atlantic Oscillation (NAO) for the period of 1961鈥2009. The first correlation analysis between rainfall and the NAO was performed using a Pearson product-moment test. Results suggested negative, mainly statistically significant correlations at annual and winter scales as was expected. However, the highest percentage of stations showed significant result in October suggesting a strong impact of a large scale atmospheric mode throughout a wet season in Serbia. Further spatial analysis that incorporated a spatial autocorrelation statistic of correlation coefficients showed significant clustering at all temporal scales.

DOI

[32]
Chopin P, Blazy J-M.Assessment of regional variability in crop yields with spatial autocorrelation: banana farms and policy implications in Martinique[J]. Agriculture, Ecosystems & Environment, 2013,181:12-21.Agricultural research can support farmers and policy makers鈥 decisions by identifying the causes of spatial variability in crop yield at a regional level. In this paper, we propose a method that combines spatial autocorrelation measures and a farm network survey. This method is intended to describe the causes of spatial variability in crop yields, along with key crop management practices for reaching the best yields and the physical and socio-economic constraints of adopting these practices. This causal and hierarchical analysis of cropping system performance has the advantage of (1) preventing bias in the correlation between variables from the yield gap analysis and (2) formulating spatially targeted policies that are aimed at relaxing adoption constraints at the territorial level. After introducing the method and its different steps, we present the results of the assessment of the spatial variability in banana yields in Martinique (Caribbean). Our study has clearly shown that the planting stage is one of the most important aspects of banana production: allowing a long fallow period, plowing for soil preparation and using seedlings that are produced by tissue culture were associated with the best yields. However, several constraints limit their adoption by farmers at the regional level. The limiting factors were steep slopes, small farm size and low cash flow. We observed no relationship between pesticide use and yields. These study results finally permit the elaboration of spatially targeted policy recommendations to improve crop yields in a sustainable manner. It mainly consists in promoting and facilitating the adoption of good plantation practices for smallholders.

DOI

[33]
Agraval R, Srikant R.Fast algorithms for mining association rules in large data bases[C]. Proceedings of the 20th International Conference on Very Large Data Bases, 1994.

[34]
Zaki M J, Parthasarathy S, Ogihara M, et al.Parallel algorithms for discovery of association rules[J]. Data Mining and Knowledge Discovery, 1997,1(4):343-173.Discovery of association rules is an important data mining task. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms make repeated passes over the database to determine the set of frequent itemsets (a subset of database items), thus incurring high I/O overhead. In the parallel case, most algorithms perform a sum-reduction at the end of each pass to construct the global counts, also incurring high synchronization cost.In this paper we describe new parallel association mining algorithms. The algorithms use novel itemset clustering techniques to approximate the set of potentially maximal frequent itemsets. Once this set has been identified, the algorithms make use of efficient traversal techniques to generate the frequent itemsets contained in each cluster. We propose two clustering schemes based on equivalence classes and maximal hypergraph cliques, and study two lattice traversal techniques based on bottom-up and hybrid search. We use a vertical database layout to cluster related transactions together. The database is also selectively replicated so that the portion of the database needed for the computation of associations is local to each processor. After the initial set-up phase, the algorithms do not need any further communication or synchronization. The algorithms minimize I/O overheads by scanning the local database portion only twice. Once in the set-up phase, and once when processing the itemset clusters. Unlike previous parallel approaches, the algorithms use simple intersection operations to compute frequent itemsets and do not have to maintain or search complex hash structures.Our experimental testbed is a 32-processor DEC Alpha cluster inter-connected by the Memory Channel network. We present results on the performance of our algorithms on various databases, and compare it against a well known parallel algorithm. The best new algorithm outperforms it by an order of magnitude.

DOI

[35]
Han J, Pei J, Yin Y.Mining frequent patterns without candidate generation[J]. Sigmod Record, 2000,29(2):1-12.Mining frequent patterns in transaction databases, timeseries databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist proli#c patterns and#or long patterns. In this study,we propose a novel frequent pattern tree #FP-tree# structure, which is an extended pre#xtree structure for storing compressed, crucial information about frequent patterns, and develop an e#cient FP-tree- based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. E#ciency of mining is achieved with three techniques: #1# a large database is compressed into a highly condensed, much smaller data structure, whichavoids costly, repeated database scans, #2# our FP-tree-based mining adopts a pattern fragment growth method to avoid the costly generation of a large n...

DOI

Outlines

/