Journal of Geo-information Science ›› 2017, Vol. 19 ›› Issue (5): 595-604.doi: 10.3724/SP.J.1047.2017.00595

• Orginal Article • Previous Articles     Next Articles

An Approach for Prediction of Web User Behavior and Data Recommendation for Geoscience Data Sharing Portals

WANG Mo1,2(), WANG Juanle1,4,*(), HE Yuntao3   

  1. 1. State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, CAS, Beijing 100101, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
    3. National Science & Technology Infrastructure Center, Beijing 100862, China
    4. Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing 210023, China
  • Received:2016-11-02 Revised:2017-01-22 Online:2017-05-20 Published:2017-05-20
  • Contact: WANG Juanle E-mail:wangm.13b@igsnrr.ac.cn;wangjl@igsnrr.ac.cn

Abstract:

Efficient and precise discovery of geoscience data on data sharing websites has been a challenge for years. This study applied Web mining techniques for National Earth Science Data Sharing Platform to derive user searching and visit behaviors using clustering algorithm. We proposed cluster-based approaches for search recommendation and visit recommendation. At data preprocessing stage, data cleaning, user identification, session identification and search terms extraction were performed. At user behavior mining stage, DBSCAN algorithm was employed for session clustering with Jaccard distance metric, considering the binary nature of session vectors. To mine user search patterns, we regard the collection of search term in each cluster as a document of text, and the collection of the whole historical search terms as corpus. Thereby, TF-IDF value of each search term in each cluster was then generated. In the scenario of online search recommendation, the real-time search term is taken to index the TF-IDF values in the clusters, and return the cluster with highest TF-IDF value. The items with top frequency is generated as recommendation list. As in the scenario of online visit recommendation, real-time visit vector is taken to query the clusters by the distance between the visit vector and cluster centroids. The nearest cluster is selected to generate most frequent items in the cluster as recommendation. Results of the experiment revealed the hot research topics of geoscience in recent years. The proposed search recommendation has a fair precision and recall, and visit recommendation was considerably improved compared to frequency-based approach. It can be concluded that: (1) web users of geoscience data sharing are more professional and predictable compared with normal web users; (2) DBSCAN is density-based clustering algorithm. It is vital to specifically define user behavior and chose a proper distance metric; (3) TF-IDF-based approach to predict users' search needs is feasible. The resulted search recommendation could be complementation to keyword-based searching. The outcome of this study would potentially contribute to the development of National Earth Science Data Sharing Platform, and even other science data sharing platform.

Key words: Web Usage Mining, spatial data mining, user behavior mining, science data sharing, Earth System Science data