地球信息科学学报 ›› 2017, Vol. 19 ›› Issue (5): 595-604.doi: 10.3724/SP.J.1047.2017.00595

• 地球信息科学理论与方法 • 上一篇    下一篇

地学数据共享网用户Web行为预测及数据推荐方法

王末1,2(), 王卷乐1,4,*(), 赫运涛3   

  1. 1. 中国科学院地理科学与资源研究所 资源与环境信息系统国家重点实验室,北京 100101
    2. 中国科学院大学,北京 100049
    3. 科技部国家科技基础条件平台中心,北京 100862
    4. 江苏省地理信息资源开发与利用协同创新中心,南京 210023
  • 收稿日期:2016-11-02 修回日期:2017-01-22 出版日期:2017-05-20 发布日期:2017-05-20
  • 通讯作者: 王卷乐 E-mail:wangm.13b@igsnrr.ac.cn;wangjl@igsnrr.ac.cn
  • 作者简介:

    作者简介:王 末(1987-),男,博士生,研究方向为地学数据共享及空间数据挖掘。E-mail:wangm.13b@igsnrr.ac.cn

  • 基金资助:
    国家科技基础条件平台——地球系统科学数据共享平台(2005DKA32300);中国科学院特色研究所培育建设服务项目(TSYJS03);中国工程科技知识中心建设项目(CKCEST-2016-3-7)

An Approach for Prediction of Web User Behavior and Data Recommendation for Geoscience Data Sharing Portals

WANG Mo1,2(), WANG Juanle1,4,*(), HE Yuntao3   

  1. 1. State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, CAS, Beijing 100101, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
    3. National Science & Technology Infrastructure Center, Beijing 100862, China
    4. Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing 210023, China
  • Received:2016-11-02 Revised:2017-01-22 Online:2017-05-20 Published:2017-05-20
  • Contact: WANG Juanle E-mail:wangm.13b@igsnrr.ac.cn;wangjl@igsnrr.ac.cn

摘要:

网络环境下,如何让用户快速发现所需数据是地学数据共享平台长期面临的挑战之一。本文基于国家地球系统科学数据共享平台网站服务器日志数据获取用户搜索行为及数据集访问行为,使用聚类算法挖掘用户行为模式,并基于会话聚类 模式开发在线搜索和访问预测算法。在数据预处理阶段,对原始服务器日志数据进行清洗、用户识别、用户会话识别、搜索词提取。在模式挖掘阶段,采用DBSCAN算法对会话进行聚类。考虑到会话向量值的二元性,聚类算法中的距离采用Jaccard距离函数计算。视每个会话聚类包含的搜索词集合为一个文本,所有用户历史搜索词集合为语料库,统计各聚类中搜索词的TF-IDF值。在线搜索推荐,以搜索词检索各聚类中TF-IDF值,返回TF-IDF值最高的搜索词所属聚类,并给出该聚类的高频项目作为推荐。在线访问推荐,则以用户实时访问向量为查询向量,计算该向量与聚类中心的聚类。根据聚类排序,给出距离最近的聚类,并产生该聚类中高频项目作为推荐。实验结果表明基于TF-IDF和聚类的搜索推荐有较高的准确率和召回率,访问推荐效果基于高频统计的推荐有较大提高。研究可得出以下结论:① 地学共享网用户访问和搜索行为体现了专业性的特点,其行为较普通网站用户可预测性更好;② 对于地学数据共享用户行为预测,需明确定义用户行为,并采用合适的距离函数描述行为相似性;③ 通过搜索词TF-IDF值来预测用户数据需求的方法可行,以此产生的推荐可作为搜索结果的补充。本研究可服务于地学领域数据共享平台建设,提高共享服务质量,也可为其他领域科学数据共享提供技术方法借鉴。

关键词: 网络数据挖掘, 用户行为预测, 用户行为模式, 科学数据共享, 地球系统科学数据

Abstract:

Efficient and precise discovery of geoscience data on data sharing websites has been a challenge for years. This study applied Web mining techniques for National Earth Science Data Sharing Platform to derive user searching and visit behaviors using clustering algorithm. We proposed cluster-based approaches for search recommendation and visit recommendation. At data preprocessing stage, data cleaning, user identification, session identification and search terms extraction were performed. At user behavior mining stage, DBSCAN algorithm was employed for session clustering with Jaccard distance metric, considering the binary nature of session vectors. To mine user search patterns, we regard the collection of search term in each cluster as a document of text, and the collection of the whole historical search terms as corpus. Thereby, TF-IDF value of each search term in each cluster was then generated. In the scenario of online search recommendation, the real-time search term is taken to index the TF-IDF values in the clusters, and return the cluster with highest TF-IDF value. The items with top frequency is generated as recommendation list. As in the scenario of online visit recommendation, real-time visit vector is taken to query the clusters by the distance between the visit vector and cluster centroids. The nearest cluster is selected to generate most frequent items in the cluster as recommendation. Results of the experiment revealed the hot research topics of geoscience in recent years. The proposed search recommendation has a fair precision and recall, and visit recommendation was considerably improved compared to frequency-based approach. It can be concluded that: (1) web users of geoscience data sharing are more professional and predictable compared with normal web users; (2) DBSCAN is density-based clustering algorithm. It is vital to specifically define user behavior and chose a proper distance metric; (3) TF-IDF-based approach to predict users' search needs is feasible. The resulted search recommendation could be complementation to keyword-based searching. The outcome of this study would potentially contribute to the development of National Earth Science Data Sharing Platform, and even other science data sharing platform.

Key words: Web Usage Mining, spatial data mining, user behavior mining, science data sharing, Earth System Science data