An Approach for Prediction of Web User Behavior and Data Recommendation for Geoscience Data Sharing Portals

WANG Mo; WANG Juanle; HE Yuntao

doi:10.3724/SP.J.1047.2017.00595

Journal of Geo-information Science >

2017 , Vol. 19 >Issue 5: 595 - 604

DOI: https://doi.org/10.3724/SP.J.1047.2017.00595

Orginal Article

An Approach for Prediction of Web User Behavior and Data Recommendation for Geoscience Data Sharing Portals

WANG Mo ^,¹^,² ,
WANG Juanle ^,¹^,⁴^,^* ,
HE Yuntao ³

Expand

1. State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, CAS, Beijing 100101, China
2. University of Chinese Academy of Sciences, Beijing 100049, China
3. National Science & Technology Infrastructure Center, Beijing 100862, China
4. Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing 210023, China

*Corresponding author: WANG Juanle, E-mail:wangjl@igsnrr.ac.cn

Received date: 2016-11-02

Request revised date: 2017-01-22

Online published: 2017-05-20

Copyright

《地球信息科学学报》编辑部所有

Fold

Abstract

Efficient and precise discovery of geoscience data on data sharing websites has been a challenge for years. This study applied Web mining techniques for National Earth Science Data Sharing Platform to derive user searching and visit behaviors using clustering algorithm. We proposed cluster-based approaches for search recommendation and visit recommendation. At data preprocessing stage, data cleaning, user identification, session identification and search terms extraction were performed. At user behavior mining stage, DBSCAN algorithm was employed for session clustering with Jaccard distance metric, considering the binary nature of session vectors. To mine user search patterns, we regard the collection of search term in each cluster as a document of text, and the collection of the whole historical search terms as corpus. Thereby, TF-IDF value of each search term in each cluster was then generated. In the scenario of online search recommendation, the real-time search term is taken to index the TF-IDF values in the clusters, and return the cluster with highest TF-IDF value. The items with top frequency is generated as recommendation list. As in the scenario of online visit recommendation, real-time visit vector is taken to query the clusters by the distance between the visit vector and cluster centroids. The nearest cluster is selected to generate most frequent items in the cluster as recommendation. Results of the experiment revealed the hot research topics of geoscience in recent years. The proposed search recommendation has a fair precision and recall, and visit recommendation was considerably improved compared to frequency-based approach. It can be concluded that: (1) web users of geoscience data sharing are more professional and predictable compared with normal web users; (2) DBSCAN is density-based clustering algorithm. It is vital to specifically define user behavior and chose a proper distance metric; (3) TF-IDF-based approach to predict users' search needs is feasible. The resulted search recommendation could be complementation to keyword-based searching. The outcome of this study would potentially contribute to the development of National Earth Science Data Sharing Platform, and even other science data sharing platform.

Key words： Web Usage Mining; spatial data mining; user behavior mining; science data sharing; Earth System Science data

Cite this article

WANG Mo , WANG Juanle , HE Yuntao . An Approach for Prediction of Web User Behavior and Data Recommendation for Geoscience Data Sharing Portals[J]. Journal of Geo-information Science, 2017 , 19(5) : 595 -604 . DOI: 10.3724/SP.J.1047.2017.00595

1 引言

地学领域在对地观测、地基探测、台站监测、统计分析等技术手段支撑下,日积月累出海量的科学数据资源。科学数据共享是实现这些数据充分利用的最有效途径。目前,国内主要的科学数据共享模式是科学数据共享机构通过建立网站平台,实现共享数据的发布,并通过下载链接或者申请的方式实现数据共享^[1-2]。网络环境下,快速、精准的数据资源发现方法是科学数据共享平台长期面临的挑战之一。地学数据具有复杂的时空属性且来源和分布广泛,传统的网络信息检索方法往往不能精确地发现满足用户所需时间、空间、主题的数据。个性化推荐是解决网络信息过载的一种有效途径,是实现科学数据共享精准化、高效化服务的重要手段。通过挖掘用户历史行为模式,掌握用户的行为规律和行为模式,可为个性化推荐提供依据。

国内外学者对Web用户行为挖掘的研究随着互联网的应用而兴起。常用的挖掘分析方法有统计分析、聚类或分类分析、关联规则或频繁项目挖掘、时间序列分析等^[3]。由于聚类分析能挖掘相似的用户或者页面特征,用户聚类和页面聚类是最普遍使用的挖掘方法^[4]。国内外研究者们采用聚类方法进行Web用户行为模式挖掘,如Mobasher等^[5]使用K-means聚类算法对用户进行聚类;Zhang等^[6]使用self-organizing maps （SOM）算法对用户会话进行聚类,并用于在线用户行为预测;邓爱林等^[7]将聚类与协同过滤推荐结合,以提高推荐系统的相应速度。但上述聚类算法存在参数选择的问题,如K-means算法,需预先设置聚类个数,而SOM算法也需选择初始参数,且面对大数据集时训练时间过长。

在个性化推荐领域,最常用的推荐方法有内容过滤和协同过滤^[8]。内容过滤是基于用户的历史兴趣和对象的属性信息给出推荐^[9],而协同过滤依赖于用户群体对物品的兴趣,计算用户之间或物品之间的相似度,并依据相似度给出推荐^[10]。基于这些推荐方法,在电商领域推荐系统已广泛应用于商品推荐,如书籍^[11]、电影^[12-13]、音乐^[14-15]。但在公益服务领域（如科学数据共享）,个性化推荐系统则鲜有应用。上述的Web用户行为挖掘可为个性化推荐算法提供模式输入。然而,这些算法的在实际应用中,关联规则、频繁项目集、序列模式等需人为决定输出模式的参数,而且此类方法难以发现频率较低、但有重要价值的行为模式^[16]。聚类分析则通过合适的算法避免主观选择挖掘模式的参数。地学数据共享用户在同一个会话过程中通常会访问主题内容、时空属性近似的地理空间数据,会话聚类模式能体现数据间的内在关联。因此,本研究拟使用聚类算法挖掘地学数据共享网用户的行为模式。本研究针对地学数据共享平台,探索用户行为模式、开发搜索和访问推荐算法,应用于地学数据共享服务。研究对象为地球系统科学数据共享平台（geodata.cn）。该平台是中国主要的地学领域数据共享网络,共享的数据资源类型全面,涵盖大气圈、陆地表层、陆地水圈、自然资源、海洋等^[17-18],在地学领域的数据共享方面具有代表性。

2 数据与方法

本研究针对地学数据共享网用户行为模式挖掘和数据推荐设计的技术流程如图1所示。该流程是一个从原始数据的预处理到推荐产生的完整过程。离线环境下流程包括数据预处理、会话聚类、搜索模式挖掘;在线环境下的流程包括实时访问推荐和实时搜索推荐。搜索模式是在会话聚类的基础上进行。实时访问推荐算法和实时搜索推荐算法分别需要输入搜索模式和访问模式运行。用户行为模式（会话聚类模式、搜索模式）的挖掘在离线环境下完成。应用场景下,实时访问推荐算法和实时搜索推荐算法分别处理用户访问和搜索,并采用相似度计算方法与离线访问模式和搜索模式进行匹配,根据匹配结果给出访问推荐和搜索推荐。

View original graphic|Download|PPT slide

Fig. 1 Workflow for search and visit recommendation

图1 搜索和访问推荐技术流程图

2.1 数据

本研究的数据来源于国家地球系统科学数据共享平台2011-2015年的Web服务器日志。Web服务器日志数据记录了访问者的导航行为,它是Web使用记录挖掘中的首要数据来源。每一次对服务器的访问相当于一个HTTP请求,在服务器访问日志里产生一条记录。每条日志记录包含多个部分（由日志格式决定）,通常包括请求的时间与日期、客户端的IP地址、所请求的资源、调用的Web应用程序所使用的参数、请求状态、使用的HTTP方法、用户代理、被哪个网络资源调用等。Web服务器日志格式为Apache的NCSA ECLF格式（图2）。

View original graphic|Download|PPT slide

Fig. 2 An example of Web server log entries

图2 Web服务器日志数据示例

5年的Web服务器日志数据约包含原始条目5437万条,网站年访问量约1100万次。以其中一条日志为例,可以从日志数据中可以整理出表1所示信息。

Tab. 1 Contents of a Web server log entry

表1 Web服务器日志数据内容

类别	详情
主机IP	128.227.49.92
时间	05/Aug/2014:10:26:42 +0800
方法	GET
URL	/extra/res/libs/kendo/extensions/kendo.extension.ui.js
协议	HTTP/1.1
状态	200
文件大小	15 072 Byte
访问来源	http://www.geodata.cn/extra/TopicsWin2/pro3.jsp
客户端	Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0

2.2 数据预处理

原始的服务器访问日志数据中存在大量的与挖掘任务无关的冗余内容,且数据格式也不能满足挖掘任务的要求。Web服务器日志数据的预处理在Web挖掘领域是一项有挑战性的工作。针对数据清理、用户识别及会话识别,此领域发展出多个数据预处理算法及启发式预处理思想^[19]。本研究数据预处理是从原始数据中提取的所需数据,并将数据转换为挖掘任务所需的格式。针对地球系统科学数据共享网站特点,Wang等在历史研究中开发了系统的数据清洗、用户识别及会话识别方法^[20],并在实际应用中证明数据预处理效果可信^[21]。本研究中亦采用该数据预处理方法。数据预处理分为数据清洗、用户识别、会话识别、搜索词提取4个步骤。

2.2.1 数据清洗

数据清洗的目的是剔除原始数据中与挖掘任务无关的内容。原始日志数据包含大量与本研究无关的日志记录项,如浏览器对声音文件、图片、CSS样式文件的请求、来自网络爬虫的请求,以及对关注的主题内容无关网页的请求等。

本研究采用的数据清洗算法包括3个步骤： ① 清理无用的网络请求记录,如对声音文件、图像文件、样式文件的请求。该过程通过检查URL中请求文件的后缀名来实现。② 对网络爬虫请求的清理。通过对日志项中客户端信息来识别搜索引擎的名称（如Baidu、Google）。③ 错误请求项的清洗。清洗错误请求的方法利用请求状态码来完成。所有日志项状态码小于200或大于400的都是不成功的请求,应被清理。

2.2.2 用户识别

用户识别是为了区分对网站访问的不同用户。本研究网站日志数据中不包含用户认证信息。IP地址是区分用户最基本的信息。然而,在局域网和使用代理服务器的环境下,仅使用IP还不足以准确识别单独的用户。在代理服务器将同一IP地址分配给多台计算机使用的情况下,通过日志中的访问来源（referrer）项及网站拓扑结构检查用户是否能通过近期历史访问页面链接到当前请求页面。本研究采用的启发式方法分为3步,详细步骤可参考文献[21]。实际应用证明,针对基于分类导航的数据共享网站,此方法区分局域网内的用户准确度较高。

2.2.3 会话识别

在用户识别的基础上,还需对用户的对网站的点击流分割为单次的访问,此过程称为会话识别。常用的方法是时间窗口法。设定一个时间阈值法来确定用户会话（如30 min）,如果某次用户访问时间超过这个阈值,就开始一个新的会话。Berendt等^[22]通过比较研究发现基于来源页面（referrer）启发式算法（Referrer-based heuristic algorithm）有较好的识别率。该方法在一定时间窗口的基础上考虑来源页面是否出现在最近的访问记录中,可视为时间窗口法的改进方法。本研究采用该方法进行会话识别。

本研究从用户会话聚类挖掘用户搜索和访问模式,因此需进一步对用户会话进行筛选。保留活跃的会话并剔除非活跃的会话可提高模式的置信度。本研究认为若会话长度达到平均长度的一半即为活跃用户会话,对用户行为预测有较高预测价值,并将这部分用户会话保留储存到数据库。

2.2.4 搜索词提取

经过上述预处理步骤后的用户会话中URL记录大部分是用户对数据的浏览和下载行为,另一部分是用户使用网站搜索功能记录的搜索行为。地球系统科学数据共享平台提供2种数据搜索方式：① 关键词检索,中文编码采用GB2312;② 检索方式为空间位置检索,中文编码为UTF-8。用户进行检索时,若输入中文搜索词,服务器日志里的URL项会以汉字编码的形式记录搜索词。因此,通过汉字解码可获取用户输入的检索词。表2给出了一个搜索词提取实例。本研究通过对日志URL项进行过滤和解码获取了用户输入的检索词,并将结果关联到用户ID、会话ID,存入数据库。

Tab. 2 Example of search word parsing

表2 搜索词提取示例

类别	详情
URL	131.111.250.153,200,HTTP/1.1,-,-,"""-""", 2014.7.1, /Portal/mdsearch/regionresult.jsp?sw=%E5%8C%97%E4%BA%AC&sm=1&ps=10,…
搜索词编码	%E5%8C%97%E4%BA%AC
编码类型	utf-8
解码结果	北京

预处理产生的会话识别结果包括了会话中用户的点击流以及相应的会话中的搜索词。接下来将所有的用户搜索词存入数据库作为语料库用于下一步的用户搜索模式预测,同时需要将活跃会话作为用户行为单元存入数据库用于下一步的聚类分析,挖掘用户搜索和访问模式。

2.3 会话聚类

2.3.1 聚类算法

本研究对采用DBSCAN聚类算法挖掘用户访问模式。DBSCAN是一种基于密度的聚类算法,最初由Easter等^[23]提出用于空间数据的聚类分析。该算法以样本空间中的低密度区域来划分高密度区域,从而得到样本聚类。该算法的一个重要特点是无需预先决定聚类数量,并可排除噪声点,从而减少人为因素对挖掘结果的影响。该算法将样本点区分为核心对象、密度可达对象以及噪声对象。

（1）核心对象（core point）。如果对象p的ε邻域半径范围内至少包含minPts个对象,则p是核心对象。在ε邻域距离范围的这些对象则为p的直接可达对象（directly reachable points）。

（2）密度可达对象（density-reachable points）。如果存在对象链

p 1, p 2, …, p n

,使得

p 1 = p, p n = q

p i + 1

是从

p i

关于ε邻域和minPts直接可达的,即

p i + 1

在

p i

的Eps邻域内,则q是p密度可达对象。

（3）噪声对象（outliers）。所有其他非任何对象的密度可达对象为噪声对象。

该算法的流程为：

该算法的执行需要输入ε邻域半径和minPts领域密度阈值2个参数。最佳的参数选择取决于样本点的空间分布特征。基本的方法是探查样本点距其第k近点（k^th nearest neighbor）的距离特征^[24],该距离被称为k-dist。若样本点属于某聚类,则k-dist的值较小。根据聚类密度的不同和样本点的随机性,所有样本点的k-dist值将表现出一定的随机性。但是若样本聚类密度较为均匀,该随机性不应发生显著波动。若某样本点不属于该聚类,则其k-dist较大。如果将所有样本点的k-dist值计算出,并将其按值从小到大作图描述。若该曲线在某点增长率剧烈变化,则该点处的距离值为理想的ε邻域半径值,k值可作为minPts参数值。ε邻域半径值取决于k值的选取。但ε值不会随k值变化产生大的变化。

2.3.2 距离计算

衡量2个样本间的相似度或者距离是进行聚类模式分析的重要步骤。对于一定的聚类算法,距离函数的选择决定了聚类结果。常用的距离函数包括欧氏距离（Euclidean distance）、曼哈顿距离（Manhattan distance）、余弦距离（Cosine distance）、切比雪夫距离（Chebyshev distance）、杰卡德距离（Jaccard distance）^[25]等。对于向量值为二元（0或1）的情形,除了杰卡德距离,不同的学科领域文献中提出了多达数十种距离衡量方法^[26]。本研究考虑到数据的高维和稀疏特性,采用在生态学领域较常用的杰卡德距离。集合A和集合B的杰卡德距离可表示为式（1）。

d J A, B = 1 - J (A, B)

（1）

式中：J（A, B）为杰卡德相似系数（式（2））,为集合A和B的交集比集合A和B的并集。因此,式（1）可表示为式（3）。

J A, B = A ⋂ B A ⋃ B = | A ⋂ B | A + B - | A ⋂ B |

（2）

d J A, B = A ⋃ B - | A ⋂ B | | A ⋃ B |

（3）

2.4 实时搜索预测

针对用户在地球系统科学数据共享平台的行为类型,本研究对用户行为预测分为搜索预测和访问预测,相应地分别在用户搜索数据时以及访问数据过程中给出数据推荐。针对用户搜索,采用信息检索领域成熟的TF-IDF统计方法^[27]结合聚类中的频繁项目模式给出推荐。词语的TF-IDF值是用来评估字词对于一个文件集或一个语料库中的其中一份文件的重要程度。词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降,因此TF-IDF高的词语能代表该文档的关键词。词语的TF-IDF值计算方法为：

TF d, t = log 1 + n d, t n d

（4）

TF - IDF d, t = TF (d, t) / n (t)

（5）

式中：n（d, t）为词语t在文档d中出现的次数;n（d）为文档d中包含的词数量;n（t）为包含有词语t的文档数量。类似于文本搜索,本研究视每个会话聚类中的用户搜索词集合为一个文档,视所有样本会话搜索词集合为语料库,计算每个搜索词的TF-IDF值。每个聚类中的每个搜索词的TF-IDF值离线计算生成,并储存到数据库。在新用户在网站输入搜索词时,若该词出现在历史记录中,根据该词在各聚类中的TF-IDF值排序给出排名最高的聚类,并根据搜索命中的聚类给出最高频的3个项目作为推荐提供给用户。算法流程为：

2.5 实时访问预测

对用户访问行为预测的基本思想是视用户访问点击流为查询向量对聚类进行匹配查询。聚类的行为特征以聚类中心表达。本研究中的会话向量以二元表示,1表示该项目被访问,0表示项目未被访问,如图3所示。聚类中心的计算方法为聚类会话向量的平均值,即向量值的和除以向量个数。聚类中心应反映聚类的主要特征,因此聚类中心向量中值过于小的元素应被剔除。本研究采用Mobasher等^[5]提出的方法,将聚类中心向量归一化,对向量元素值同比例放大使得向量中最大值为1,并将归一化后向量中值小于0.5的元素剔除。

View original graphic|Download|PPT slide

Fig. 3 User session vector

图3 用户会话向量

用户在线访问时,实时记录用户访问流,转化为实时访问向量。根据历史会话长度特征,选取合适的阈值,当用户项目数达到该阈值时,计算访问向量与聚类中心的Jaccard距离。该距离最小的聚类即是访问向量最可能属于的聚类。经统计,2011-2015年用户平均会话长度（即访问数据集个数）为6.77个。为了快速计算出数据推荐,同时尽可能准确地获取用户兴趣,本研究取平均会话长度的一半,即3为阈值。当用户在线数据集访问数量达到3个后,即开始匹配用户会话到最接近的聚类,并给用户推荐聚类中最高频的3个数据集。需要指出的是,给出的推荐应排除会话中用户已经访问过的数据集。通过聚类中数据集频率从高到低排序,并排除已访问的数据集,给出前3位的数据集作为推荐。算法流程为：

2.6 推荐效果评价

与信息检索领域的评价方法类似,推荐系统结果常用的评价指标有准确率（precision）、召回率（recall）及F1^[28]。推荐结果的准确率是指命中的推荐数量与总推荐数量的比例;召回率描述命中的推荐数量与用户访问项目数的比例;而F1则是将二者结合作为一个指标。本研究采用准确率和召回率两个指标评价推荐结果。对用户u推荐N个项目（记为R（u））,令用户u在测试集上访问过的项目集合为T（u）,准确率和召回率的计算公式分别为：

3 实验及结果

3.1 实验设计

本研究以2011-2014年日志数据为训练集,以2015年日志数据为测试集。首先,依据2.2节介绍的数据预处理方法,对5年的日志数据进行预处理和搜索词提取并入库;然后,对训练集数据进行会话聚类和高频项目统计,并将结果储存供下一步查询使用。在大数据集上运行聚类算法的内存开销极大。考虑到实验环境和时间的限制,本研究采取的策略为先对部分数据进行聚类,然后将剩下数据依据聚类中心划分到相应聚类。考虑到数据的时效性,以时间上最近的2014年数据为基础聚类数据,使用DBSCAN算法进行聚类。然后,根据2011-2013年的会话向量与聚类中心的距离将其划分到相应的聚类。在此基础上,计算最终的聚类中心,并将聚类中心向量以聚类属性储存为相应聚类的属性值。同时,需要计算聚类中搜索词的TF-IDF值,将结果关联对应聚类储存到语料统计数据库。上述步骤相当于应用场景下的离线计算部分。

分别针对用户的搜索和访问行为,预测效果实验分为2个部分：① 搜索模式预测效果实验。将2015年的用户会话进一步划分为“搜索+访问”的行为单元,即在一个会话内以一个搜索词和接下来的网页访问,到出现下一个搜索词为止,为一个行为单元。测试时,将搜索词输入预测算法,获得推荐。将算法推荐的项目与该行为单元内的实际网页访问进行符合度统计,即可得出算法推荐效果。② 数据集访问模式预测效果实验。为验证用户访问预测效果,该实验从测试数据集会话中去除二分之一会话长度的数据集访问,保留用户的前二分之一会话访问。并筛选出剩余会话长度大于或等于3的会话,作为测试数据。将测试数据向量输入用户访问预测算法,输出推荐结果,并计算推荐效果。为对比实验效果,本研究同时按照目前网站已有的基于简单高频统计的推荐计算准确率和召回率,并与本研究的推荐结果进行对比。

由于数据集较大,对实验环境要求较高。实验环境为Windows 10 64位操作系统、Intel Core i7处理器、16 GB内存。程序语言为Python,数据库为MySQL。DBSCAN聚类算法采用scikit-learn提供的开源代码^[29]。

3.2 数据预处理结果

原始的2011-2015年的Web服务器日志数据的记录约为5349万条。经过数据清洗,获得的有效日志记录约1353万条,约为原始数据的四分之一。各年度数据预处理结果详细信息如表3所示。从数据预处理结果可看出该地球系统科学数据共享平台用户访问和搜索增长趋势。用户数量除2014年外,都呈增长趋势。

Tab. 3 Statistics for data preprocessing

表3 数据预处理结果统计

年份	原始日志记录/条	清洗后记录/条	用户数/个	会话数/个	活跃会话/个	搜索次数/次	搜索词数量/次
2011	10 062 608	2 664 473	62 557	219 918	54 121	76 793	4589
2012	9 546 068	2 394 507	76 098	234 585	55 726	82 914	3883
2013	10 584 125	2 708 978	82 302	264 906	58 237	110 056	5426
2014	11 062 608	2 845 150	78 111	348 495	68 562	111 913	6243
2015	12 236 056	2 914 507	89 937	365 752	70 969	122 868	6761

3.3 用户搜索热点

2011-2015年用户共进行了约50万次搜索,产生的搜索词词库的独立词个数为8624个。将5年的用户搜索词汇总统计可得出用户搜索热点,反映出地球系统科学领域的研究热点主题。图4为用户搜索次数超过100次的搜索词云图。字体的大小反映了搜索词出现频率的高低。统计表明“土壤”、“青藏”、“冰川”、“土地利用”、“黄土高原”、“沙漠”、“三角洲”等主题的数据最受用户关注。

View original graphic|Download|PPT slide

Fig. 4 Word cloud of search terms

图4 搜索热点云图

3.4 聚类结果

对2014年活跃会话进行聚类矩阵计算后,采用2.3.1节描述的方法统计k-dist值。算法默认的k值为4,考虑到数据量较大,k值取10。k-dist统计曲线图如图5所示。曲线在y轴值0.26处增长率变化明显。因此,本研究DBSCAN算法参数ε邻域为0.26,minPts为10。最终获得的聚类个数为237个。

View original graphic|Download|PPT slide

Fig. 5 K-dist plot of sample data （sorted by distance to 10th nearest neighbor）

图5 聚类样本的k-dist图（以第10近的样本距离排序）

分析聚类结果发现聚类具有明显的主题相关性,即同一会话聚类中数据趋于来自同一主题,如最大的聚类（聚类中会话数量最大）中有83.6%的数据属于该数据共享平台“土地覆被及土地利用”分类。前5个最大聚类数据主题统计如表4所示。该结果表明用户在同一会话中往往只访问下载同一个主题类型的数据。此外,排名前列的聚类数据主题与图4所示的用户搜索热词有明显相关度。如搜索热词“土地利用”、“土壤”、“山地灾害”、“冰川”等与前五位的聚类数据主题高度相符。这一现象可能是由于网站的检索是基于文本匹配设计,用户通常先进行关键词检索,然后访问相应主题的数据。

Tab. 4 Statistics for cluster theme （top 5）

表4 聚类主题统计（前5）

聚类编号	聚类会话数量	数据主题	主题占比/%
01	8355	土地覆被及土地利用	83.6
02	6712	土壤数据	78.4
03	6248	环境与灾害	84.3
04	5471	地形地貌	74.5
05	4676	冰川冻土	69.8

3.5 用户搜索及访问预测

以2015年用户活跃会话作为测试集,分别计算搜索推荐和访问推荐的准确率和召回率（图6）。为对比访问推荐方法的效果,本研究同时获取了现有的基于数据分类下按月高频统计访问推荐。实验结果表明：搜索推荐的准确率达26.4%,召回率达31.7%;用户数据集访问推荐的准确率达31.5%,召回率达37.8%;而现有的基于高频统计的前3个推荐的准确率为14.8%,召回率为17.6%。由此可见,基于会话聚类的用户访问推荐效果比简单的高频统计推荐有较大提高。约平均每推荐3个数据集,其中1个是用户感兴趣的。

View original graphic|Download|PPT slide

Fig. 6 Comparisons of precision and recall

图6 准确率和召回率对比图

4 结论及展望

本研究以地球系统科学数据共享平台为例,应用TF-IDF方法进行搜索推荐和DBSCAN聚类算法进行会话聚类,获取的用户行为模式对地学科学数据的共享和使用挖掘具有参考价值,开发的搜索推荐和访问推荐方法亦有实用价值。该方法依赖于其他用户的历史会话,属于协同过滤的范畴。与传统协同过滤方法不同的是本研究以用户会话为行为模式单元,而非单个用户。由于该方法大部分计算量是在离线环境下完成,在线应用场景只需获取用户搜索词或者访问向量来查询离线计算好的搜索模式和访问模式,可达到实时反馈的计算速度,适用于在线用户搜索和访问推荐。

本文实验以2011-2014年数据作为训练集,2015年数据作为测试集,对推荐效果进行验证。通过实验结果可得出以下结论：

（1）用户会话聚类表现出主题相关性,且聚类主题与搜索热词具有明显一致性。地学数据共享网用户与普通网站用户相比,访问和搜索行为体现了专业性的特点,用户访问时往往有明确的数据需求,因此其行为较普通网站用户可预测性更好。

（2）DBSCAN算法是基于密度统计的,距离函数的选择对于聚类结果有重要影响。因此,对于地学数据共享用户行为预测,需明确定义用户行为,并采用合适的距离函数描述行为相似性。

（3）通过搜索词TF-IDF值来预测用户数据需求的方法可行。以此产生的推荐可作为搜索结果的补充,使数据搜索功能智能化。基于聚类的访问推荐具有较好的准确度,较原有的基于高频访问统计的推荐方法有较大提高。

本文提出的方法和结论除了地学数据共享服务,还可应用于其他领域的科学数据共享服务。由于数据和实验环境的限制,本研究仍存在一些局限：① 本研究的用户搜索词语料库来源于历史记录,历史搜索词可能未涵盖平台共享数据所有的主题;② 研究实验的验证数据来自历史日志数据。虽然能通过历史数据获取用户的数据兴趣用来评价推荐效果,但真实环境下的用户体验仍无法获得。未来研究将通过数据集的元数据进一步扩展搜索词语料库,并引入同义词、近义词词库,提升搜索推荐的应用场景;同时,将本方法部署到地学及相关领域数据共享平台网络服务器中,通过实际应用进一步进行用户体验评价。

致谢：感谢国家科技基础条件平台——地球系统科学数据共享平台为本研究提供数据支持。

The authors have declared that no competing interests exist.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	李娟,刘德洪,江洪. 国际科学数据共享现状研究[J].图书馆建设,2009(2):19-22. [Li J, Liu D H, Jiang H.Research on international scientific data sharing[J]. Library Development, 2009,2:19-22. ]

[2]	王卷乐,诸云强,谢传节.地球系统科学数据共享网络平台的设计和开发[J].地学前缘,2006,13(3):54-59. [Wang J L, Zhu Y Q, Xie C J.Network platform design and development for Earth System Science data sharing[J]. Earth Science Frontiers, 2006,13(3):54-59. ]

[3]	Liu B.Web usage mining[J]. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2007:449-483.

[4]	Liu B.Web data mining: exploring hyperlinks, contents, and usage data[M]. New York: Springer Science & Business Media, 2007.

[5]

Mobasher

, Dai

, Luo

, et al.Discovery and evaluation of aggregate usage profiles for web personalization[J]. Data Mining and Knowledge Discovery, 2002,6(1):61-82.

lt;a name="Abs1"></a>Web usage mining, possibly used in conjunction with standard approaches to personalization such as collaborative filtering, can help address some of the shortcomings of these techniques, including reliance on subjective user ratings, lack of scalability, and poor performance in the face of high-dimensional and sparse data. However, the discovery of patterns from usage data by itself is not sufficient for performing the personalization tasks. The critical step is the effective derivation of good quality and useful (i.e., actionable) <img src="/content/KW382140624TQ513/xxlarge8220.gif" alt="ldquo" align="MIDDLE" border="0">aggregate usage profiles<img src="/content/KW382140624TQ513/xxlarge8221.gif" alt="rdquo" align="MIDDLE" border="0"> from these patterns. In this paper we present and experimentally evaluate two techniques, based on clustering of user transactions and clustering of pageviews, in order to discover overlapping aggregate profiles that can be effectively used by recommender systems for real-time Web personalization. We evaluate these techniques both in terms of the quality of the individual profiles generated, as well as in the context of providing recommendations as an integrated part of a personalization engine. In particular, our results indicate that using the generated aggregate profiles, we can achieve effective personalization at early stages of users' visits to a site, based only on anonymous clickstream data and without the benefit of explicit input by these users or deeper knowledge about them.

DOI

[6]

Zhang

, Edwards

, Harding J

Personalised online sales using web usage data mining[J]. Computers in Industry, 2007,58:772-782.

Practically every major company with a retail operation has its own web site and online sales facilities. This paper describes a toolset that exploits web usage data mining techniques to identify customer Internet browsing patterns. These patterns are then used to underpin a personalised product recommendation system for online sales. Within the architecture, a Kohonen neural network or self-organizing map (SOM) has been trained for use both offline, to discover user group profiles, and in real-time to examine active user click stream data, make a match to a specific user group, and recommend a unique set of product browsing options appropriate to an individual user. Our work demonstrates that this approach can overcome the scalability problem that is common among these types of system. Our results also show that a personalised recommender system powered by the SOM predictive model is able to produce consistent recommendations.

DOI

[7]

邓爱林,左子叶,朱扬勇.基于项目聚类的协同过滤推荐算法[J].小型微型计算机系统, 2004,25(9):1665-1670.

推荐系统是电子商务中最重要的技术之一 ,协同过滤是推荐系统中采用最为广泛也是最成功的推荐技术 .随着电子商务系统用户数目和商品数目日益增加 ,在整个用户空间上寻找目标用户的最近邻居非常耗时 ,导致推荐系统的实时性要求难以保证 .针对上述问题 ,本文提出了一种基于项目聚类的协同过滤推荐算法 ,根据用户对项目评分的相似性对项目进行聚类 ,生成相应的聚类中心 ,在此基础上计算目标项目与聚类中心的相似性 ,从而只需要在与目标项目最相似的若干个聚类中就能寻找到目标项目的大部分最近邻居并产生推荐列表 .实验结果表明 ,本算法可以有效提高推荐系统的实时响应速度

DOI

[ Deng A

, Zuo Z

, Zhu Y

Colaborative filtering recommendation algorithm based on item clustering[J]. Mini-micro Systems, 2004,25(9):1665-1670. ]

[8]

王国霞,刘贺平.个性化推荐系统综述[J].计算机工程与应用,2012,48(7):66-76.

信息超载是目前网络用户面临的一个严重问题，个性化推荐系统是解决该问题的一个有力工具，并受到了众多的关注和研究。给出推荐系统的定义，同时阐述了推荐系统的几项关键技术，包括用户建模、推荐对象的建模和推荐算法。后来总结了推荐系统的体系结构和性能评价指标，并尝试给出了推荐系统未来研究的重点、难点和热点问题。

[Wang G

, Liu H

A survey on personalised recommender systems[J]. Computer Engineering and Applications, 2012,48(7):66-76. ]

[9]	Van Meteren R, Van Someren M.Using content-based filtering for recommendation[C]. Proceedings of the Machine Learning in the New Information Age: MLnet/ECML2000 Workshop, F, 2000.

[10]	Herlocker J L, Konstan J A, Terveen L G, et al.Evaluating collaborative filtering recommender systems[J]. ACM Transactions on Information Systems (TOIS), 2004,22(1):5-53.

[11]	Vaz P C, Martins de Matos D, Martins B, et al. Improving a hybrid literary book recommendation system through author ranking[C]. Proceedings of the 12^th ACM/IEEE-CS joint conference on Digital Libraries, F, 2012.

[12]	Azaria A, Hassidim A, Kraus S, et al.Movie recommender system for profit maximization[C]. Proceedings of the 7th ACM conference on Recommender systems, 2013.

[13]

Wei

, Zheng

, Chen

, et al.A hybrid approach for movie recommendation via tags and ratings[J]. Electronic Commerce Research and Applications, 2016,18:83-94.

Selecting a movie often requires users to perform numerous operations when faced with vast resources from online movie platforms. Personalized recommendation services can effectively solve this problem by using annotating information from users. However, such current services are less accurate than expected because of their lack of comprehensive consideration for annotation. Thus, in this study, we propose a hybrid movie recommendation approach using tags and ratings. We built this model through the following processes. First, we constructed social movie networks and a preference-topic model. Then, we extracted, normalized, and reconditioned the social tags according to user preference based on social content annotation. Finally, we enhanced the recommendation model by using supplementary information based on user historical ratings. This model aims to improve fusion ability by applying the potential effect of two aspects generated by users. One aspect is the personalized scoring system and the singular value decomposition algorithm, the other aspect is the tag annotation system and topic model. Experimental results show that the proposed method significantly outperforms three categories of recommendation approaches, namely, user-based collaborative filtering (CF), model-based CF, and topic model based CF.

DOI

[14]

Domingues M

, Gouyon

, Jorge A

, et al.Combining usage and content in an online recommendation system for music in the long tail[J]. International Journal of Multimedia Information Retrieval, 2013,2(1):3-13.

Abstract Nowadays, a large number of people consume music from the web. Web sites and online services now typically contain millions of music tracks, which complicates search, retrieval, and discovery of music. Music recommender systems can address these issues by recommending relevant and novel music to a user based on personal musical tastes. In this paper, we propose a hybrid music recommender system, which combines usage and content data. We describe an online evaluation experiment performed in real-time on a commercial web site, specialized in content from the very Long Tail of music content. We compare it against two stand-alone recommender systems, the first system based on usage and the second one based on content data (namely, audio and textual tags). The results show that the proposed hybrid recommender shows advantages with respect to usage-based and content-based systems, namely, higher user absolute acceptance rate, higher user activity rate and higher user loyalty.

DOI

[15]	Wang X, Wang Y.Improving content-based and hybrid music recommendation using deep learning[C]. Proceedings of the 22^nd ACM international conference on Multimedia, 2014.

[16]

Guerbas

, Addam

, Zaarour

, et al.Effective web log mining and online navigational pattern prediction[J]. Knowledge-Based Systems, 2013,49:50-62.

Accurate web log mining results and efficient online navigational pattern prediction are undeniably crucial for tuning up websites and consequently helping in visitors retention. Like any other data mining task, web log mining starts with data cleaning and preparation and it ends up discovering some hidden knowledge which cannot be extracted using conventional methods. In order for this process to yield good results it has to rely on some good quality input data. Therefore, more focus in this process should be on data cleaning and pre-processing. On the other hand, one of the challenges facing online prediction is scalability. As a result any improvement in the efficiency of online prediction solutions is more than necessary. As a response to the aforementioned concerns we are proposing an enhancement to the web log mining process and to the online navigational pattern prediction. Our contribution contains three different components. First, we are proposing a refined time-out based heuristic for session identification. Second, we are suggesting the usage of a specific density based algorithm for navigational pattern discovery. Finally, a new approach for efficient online prediction is also suggested. The conducted experiments demonstrate the applicability and effectiveness of the proposed approach.

DOI

[17]

王卷乐,孙九林.地球系统科学数据共享标准规范体系研究与应用[J].地理科学进展,2010,28(6):839-847.

地球系统科学是地球科学发展的一个新的历史阶段,其以地球系统及其整体行为作为研究对象,迫切需要大量多学科、多来源、多类型、综合性地学数据资源的支撑.针对这一需求,我国国家科技基础条件平台设立了"地球系统科学数据共享网"这一支撑条件.为了构建该支撑条件的标准规范环境,本文分析了地球系统科学数据的特征,剖析了"研究型"数据对标准规范的需求,指出了其面临的3个关键问题.研究了地球系统科学数据共享的概念模式,根据定义的4条基本原则,分析了地球系统科学数据共享标准规范体系的定位,构建其体系结构.该体系结构包括4个大类,即机制条例类、数据管理类、平台开发类、数据服务类,具体包括18项条例、办法、规范和技术标准.其中,地球系统科学数据共享联盟章程、核心元数据标准、数据质量管理办法、数据分类标准是该体系中的引领性、核心标准规范.经过近6年的研究和应用,目前该标准规范体系已经在地球系统科学数据共享网的总中心和13个分中心试用,取得了良好的运行服务效果.未来,地球系统科学数据共享标准规范将"向下"、"向上"两个方向继续发展.

DOI

[Wang J

, Sun J

Study on scientific data sharing standards and specifications systems for Earth System Science and its application[J]. Progress in Geography, 2010,28(6): 839-847. ]

[18]

诸云强,孙九林,廖顺宝,等.地球系统科学数据共享研究与实践[J].地球信息科学学报,2010,12(1):1-8.

分布式、异构科学数据的整合集成与"一站式"共享服务是科学数据共享的关键和难点。首先,提出"创建地球系统科学数据共享联盟,共建、共享"的分散数据资源整合理念和按"总中心—分中心—数据资源点"三个层次的整合架构,然后,在组织模式上保障分布式数据资源的有效整合。即通过"元数据集中管理,数据体分散存储"的策略,从技术上保障分布式数据资源的快速整合。针对地球系统科学不同学科数据资源的特性,设计了地球系统科学数据核心元数据标准及扩展方案,利用MVC(元数据标准模型—显示视图—操作函数)模式实现多标准地学元数据的统一管理和自适应显示。最终,研究面向SOA的分布式地球系统科学数据共享平台,通过"一个总中心,认证中心和若干个分中心"形成物理上分布、逻辑上统一的分布式服务网络,从而为用户提供"一站式"的数据共享服务。

[Zhu Y

, Sun J

, Liao S

, et al.Earth system scientific data sharing research and practice[J] Journal of Geo-information Science, 2010,12(1):1-8. ]

[19]

Chitraa

, Davamani

, Selvdoss

A survey on preprocessing methods for web usage data[J]. arXiv preprint arXiv:10041257, 2010,7(3):78-83.

World Wide Web is a huge repository of web pages and links. It provides abundance of information for the Internet users. The growth of web is tremendous as approximately one million pages are added daily. Users' accesses are recorded in web logs. Because of the tremendous usage of web, the web log files are growing at a faster rate and the size is becoming huge. Web data mining is the application of data mining techniques in web data. Web Usage Mining applies mining techniques in log data to extract the behavior of users which is used in various applications like personalized services, adaptive web sites, customer profiling, prefetching, creating attractive web sites etc., Web usage mining consists of three phases preprocessing, pattern discovery and pattern analysis. Web log data is usually noisy and ambiguous and preprocessing is an important process before mining. For discovering patterns sessions are to be constructed efficiently. This paper reviews existing work done in the preprocessing stage. A brief overview of various data mining techniques for discovering patterns, and pattern analysis are discussed. Finally a glimpse of various applications of web usage mining is also presented.

[20]	Wang M, Wang J.A data preprocessing framework of geoscience data sharing portal for user behavior mining[C]. Proceedings of the Geoinformatics, 2015 23^rd International Conference on, F 19-21 June, 2015.

[21]

王末,王卷乐. Web 环境下地学数据共享用户行为模式分析[J].地球信息科学学报,2016,18(9):1174-1183.

lt;p>了解科学数据共享用户行为特征对实现高效、精准的数据共享服务具有重要的参考意义。本文基于国家地球系统科学数据共享平台网站服务器日志及服务记录数据,利用空间数据挖掘及Web使用挖掘技术,探索地球系统科学数据共享用户行为模式。在数据预处理阶段,完成用户识别、会话识别、位置识别,并对数据进行空间建模、空间数据库建库。在数据挖掘阶段,分别对用户产生的网页浏览数、会话数、数据集浏览数为对象进行空间“热点”分析,识别用户行为的地域差异。针对用户数据浏览和下载行为,采用FP-growth算法对用户——数据之间进行关联规则挖掘,发现用户对数据关注和使用的高频规律。分析结果表明：（1）该共享平台用户地在国内各省市均有分布,用户最多的3个省（市）分别为北京市、山东省、江苏省,该分布与国内高校学生分布相关程度不高,但与“211工程”高校学生的空间分布相关度较高;（2）空间“热点”分析表明,北京、天津及河北北部无论在网页浏览、数据浏览还是会话量上都是“热点”区域,但识别的“冷点”区域有较大不同,尤其是数据访问“冷点”分布较广,如南方沿海省份、河南省、山东省、四川省等;（3）关联规则挖掘发现多个数据浏览高频项目集以及关联规则。数据下载高频项与数据浏览高频模式较好吻合,但下载行为未表现出明显关联规则。本文提供了一种结合Web使用挖掘和空间数据挖掘的用户行为模式挖掘方法,该方法也可用于其他类型网站的数据挖掘。

DOI

[Wang

, Wang J

A study on user behavior of geoscience data sharing based on web usage mining[J]. Journal of Geo-informaiton Science, 2016,18(9):1174-1183. ]

[22]	Berendt B, Mobasher B, Nakagawa M, et al.The impact of site structure and user environment on session reconstruction in web usage analysis[C]. Proceedings of the International Workshop on Mining Web Data for Discovering Usage Patterns and Profiles, 2002.

[23]	Ester M, Kriegel H P, Sander J, et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]. Proceedings of the Kdd, 1996.

[24]	Tan P N.Introduction to data mining[M]. Pearson Education India, 2006.

[25]	Jaccard P.The distribution of the flora in the alpine zone[J]. New phytologist, 1912,11(2):37-50.Comment in J Pediatr. 2006 Aug;149(2):281. DOI

[26]

Choi S

, Cha S

, Tappert C

A survey of binary similarity and distance measures[J]. Journal of Systemics, Cybernetics and Informatics, 2010,8(1):43-48.

The binary feature vector is one of the most common representations of patterns and measuring similarity and distance measures play a critical role in many problems such as clustering, classification, etc. Ever since Jaccard proposed a similarity measure to classify ecological species in 1901, numerous binary similarity and distance measures have been proposed in various fields. Applying appropriate measures results in more accurate data analysis. Notwithstanding, few comprehensive surveys on binary measures have been conducted. Hence we collected 76 binary similarity and distance measures used over the last century and reveal their correlations through the hierarchical clustering technique.

[27]	Sparck Jones K.A statistical interpretation of term specificity and its application in retrieval[J]. Journal of documentation, 1972,28(1):11-21.

[28]

Konstan J

Introduction to recommender systems: Algorithms and evaluation[J]. ACM Transactions on Information Systems (TOIS), 2004,22(1):1-4.

Abstract Recommender systems have been evaluated in many, often incomparable, ways. In this article, we review the key decisions in evaluating collaborative filtering recommender systems: the user tasks being evaluated, the types of analysis and datasets being ...

DOI

[29]

Pedregosa

, Varoquaux

, Gramfort

, et al.Scikit-learn: Machine learning in Python[J]. Journal of Machine Learning Research, 2011,12(10):2825-2830.

Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.Pedregosa, Fabian; Varoquaux, Ga毛l; Gramfort, Alexandre; Michel, Vincent; Thirion, Bertrand; Grisel, Olivier; Blondel, Mathieu; Prettenhofer, Peter; Weiss, Ron; Dubourg, Vincent; Vanderplas, Jake; Passos, Alexandre; Cournapeau, David; Brucher, Matthieu; Perrot, Matthieu; Duchesnay, douard

DOI

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

1 引言

2 数据与方法

Fig. 1 Workflow for search and visit recommendation

2.1 数据

Fig. 2 An example of Web server log entries

Tab. 1 Contents of a Web server log entry

2.2 数据预处理

Tab. 2 Example of search word parsing

2.3 会话聚类

2.4 实时搜索预测

2.5 实时访问预测

Fig. 3 User session vector

2.6 推荐效果评价

3 实验及结果

3.1 实验设计

3.2 数据预处理结果

Tab. 3 Statistics for data preprocessing

3.3 用户搜索热点

Fig. 4 Word cloud of search terms

3.4 聚类结果

Fig. 5 K-dist plot of sample data （sorted by distance to 10th nearest neighbor）

Tab. 4 Statistics for cluster theme （top 5）

3.5 用户搜索及访问预测

Fig. 6 Comparisons of precision and recall

4 结论及展望

References